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Preface 


l am indebted to the many readers and colleagues who have written to me on several occasions with 
encouraging feedback on my earlier textbook Fundamentals of Adaptive Filtering (Wiley, NJ, 2003). 
Their enthusiastic comments encouraged me to pursue this second project in an effort to create a 
revised version for teaching purposes. During this exercise, I decided to remove some advanced 
material and move select topics to the problems. I also opted to fundamentally restructure the entire 
text into eleven consecutive parts with each part consisting of a series of focused lectures and ending 
with bibliographic comments, problems, and computer projects. I believe this restructuring into 
a sequence of lectures will provide readers and instructors with more flexibility in designing and 
managing their courses. I also collected most background material on random variables and linear 
algebra into three chapters at the beginning of the book. Students and readers have found this material 
of independent interest in its own right. At the same time, I decided to maintain the same general style 
and features of the earlier publication in terms of presentation and exposition, motivation, problems, 
computer projects, summary, and bibliographic notes. These features have been well received by our 
readers. 


AREA OF STUDY 


Adaptive filtering is a topic of immense practical relevance and deep theoretical challenges that 
persist even to this date. There are several notable texts on the subject that describe many of the 
features that have marveled students and researchers over the years. In this textbook, we choose 
to step back and to take a broad look at the field. In so doing, we feel that we are able to bring 
forth, to the benefit of the reader, extensive commonalities that exist among different classes of 
adaptive algorithms and even among different filtering theories. We are also able to provide a uniform 
treatment of the subject in a manner that addresses some existing limitations, provides additional 
insights, and allows for extensions of current theory. 

We do not have any illusions about the difficulties that arise in any attempt at understanding 
adaptive filters more fully. This is because adaptive filters are, by design, time-variant, nonlinear, 
and stochastic systems. Any one of these qualifications alone would have resulted in a formidable 
system to study. Put them together and you face an almost impossible task. It is no wonder then that 
current practice tends to study different adaptive schemes separately, with techniques and assump- 
tions that are usually more suitable for one adaptation form over another. It is also no surprise that 
most treatments of adaptive filters, including the one adopted in this textbook, need to rely on some 
simplifying assumptions in order to make filter analysis and design a more tractable objective. 

Still, in our view, three desirable features of any study of adaptive filters would be (1) to attempt 
to keep the number of simplifying assumptions to a minimum, (2) to delay their use until necessary, 
and (3) to apply similar assumptions uniformly across different classes of adaptive algorithms. This 
last feature enables us to evaluate and compare the performance of adaptive schemes under similar 
assumptions on the data, while delaying the use of assumptions enables us to extract the most infor- 
mation possible about actual filter performance. In our discussions in this book we pay particular 
attention to these three features throughout the presentation. 

In addition, we share the conviction that a thorough understanding of the performance and limi- 
tations of adaptive filters requires a solid grasp of the fundamentals of least-mean-squares estimation 
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theory. These fundamentals help the designer understand what it is that an adaptive filter is trying 
to accomplish and how well it performs in this regard. For this reason, Parts | (Optimal Estimation) 
and || (Linear Estimation) of the book are designed to provide the reader with a self-contained and 
easy-to-follow exposition of estimation theory, with a focus on topics that are relevant to the subject 
matter of the book. In these initial parts, special emphasis is placed on geometric interpretations of 
several fundamental results. The reader is advised to pay close attention to these interpretations since 
it will become clear, time and again, that cumbersome algebraic manipulations can often be simpli- 
fied by recourse to geometric constructions. These constructions not only provide a more lasting 
appreciation for the results of the book, but they also expose the reader to powerful tools that can be 
useful in other contexts as well, other than adaptive filtering and estimation theory. 

The reader is further advised to master the convenience of the vector notation, which is used 
extensively throughout this book. Besides allowing a compact exposition of ideas and a compact 
representation of results, the vector notation also allows us to exploit to great effect several important 
results from linear algebra and matrix theory and to capture, in elegant ways, many revealing charac- 
teristics of adaptive filters. We cannot emphasize strongly enough the importance of linear algebraic 
and matrix tools in our presentation, as well as the elegance that they bring to the subject. The com- 
bined power of the geometric point of view and the vector notation is perhaps best exemplified by our 
detailed treatment later in this book of least-squares theory and its algorithmic variants. Of course, 
the reader is exposed to geometric and vector formulations in the early chapters of the book. 


STRUCTURE OF THE BOOK 


The book is divided into eleven core parts, in addition to a leading part on Background Material and 
a trailing part on References and Indices. Table P.1 lists the various parts. Each of the core parts, 
numbered I through XI, consists of four distinctive elements in the following order: (i) a series of lec- 
tures where the concepts are introduced, (ii) a summary of all lectures combined, (iii) bibliographic 
commentary, and (iv) problems and computer projects. 


Lectures and Concepts. In the early parts of the book, each concept is motivated from first 
principles; starting from the obvious and ending with the more advanced. We follow this route of 
presentation until the reader develops enough maturity in the field. As the book progresses, we ex- 
pect the reader to become more sophisticated and, therefore, we cut back on the "obvious." 


Summaries. For ease of reference, at the end of each part, we collect a summary of the key con- 
cepts and results introduced in the respective lectures. 


Bibliographic Commentaries. In the remarks at the end of each part we provide a wealth of 
references on the main contributors to the results discussed in the respective lectures. Rather than 
scatter references throughout the lectures, we find it useful to collect all references at the end of the 
part in the form of a narrative. We believe that this way of presentation gives the reader a more 
focused perspective on how the references and the contributions relate to each other both in time and 
context. 


Problems. The book contains a significant number of problems, some more challenging than others 
and some more applied than others. The problems should be viewed as an integral part of the text, 
especially since additional results appear in them. It is for this reason, and also for the benefit of the 
reader, that we have chosen to formulate and design most problems in a guided manner. Usually, 
and especially in the more challenging cases, a problem starts by stating its objective followed by 
a sequence of guided steps until the final answer is attained. In most cases, the answer to each 
step appears stated in the body of the problem. In this way, a reader would know what the answer 
should be, even if the reader fails to solve the problem. Thus rather than ask the reader to "find an 
expression for z,", we would generally ask instead to "show that x is given by z = ..." and then 
give the expression for z. 


All instructors can request copies of a free solutions manual from the publisher. 


Moreover, several problems in the book have been designed to introduce readers to useful topics from 
related fields, such as multi-antenna receivers, cyclic-prefixing, maximal ratio combining, OFDM re- 
ceivers, and so forth. Students are usually surprised to learn how classical concepts and ideas form 
the underpinnings of seemingly advanced techniques. 


Computer Projects. We have included several computer projects (see the listing in Table P.2) to 
show students, and also practitioners, how the results developed in the book can be useful in situations 
of practical interest (e.g., linear equalization, decision feedback equalization, channel estimation, 
beamforming, tracking fading channels, line echo cancellation, acoustic echo cancellation, active 
noise control, OFDM receivers). In designing these projects, we have made an effort at choosing 
topics that are relevant to practitioners. We have also made an effort at illustrating to students how 
a solid theoretical understanding can guide them in challenging situations. MATLAB! programs 
are available for solving all computer projects in the book, in addition to a solutions manual. The 
programs are offered without any guarantees. While we have found them to be effective for the 
instructional purposes of this textbook, the programs are not intended to be examples of full-blown 
or optimized designs; practitioners should use them at their own risk. For example, in order to 
keep the codes at a level that is easy to understand by students, we have often decided to sacrifice 
performance in lieu of simplicity. 


MATLAB programs that solve all computer projects in the book, in addition to a solutions 
manual for the projects with extensive commentary and typical performance plots, can be 


downloaded for free by all readers (including students and instructors) from either the 
publisher's website or the author's website. 


Background Material. We provide three self-contained chapters that explain all the required back- 
ground material on random variables and linear algebra for the purposes of this book. Actually, after 
progressing sufficiently enough in the book, students will be able to master many useful concepts 
from linear algebra and matrix theory, in addition to adaptive filtering. 


COVERAGE AND TOPICS 


The material in the book can be categorized into five broad areas of study (A through E), as listed 
in Table P.3. Area A covers the fundamentals of least-mean-squares estimation theory with several 
application examples. Areas B and C deal mainly with LMS-type adaptive filters, while areas D and 
E deal with least-squares-type adaptive filters. If an instructor wishes to focus mostly on LMS-type 
filters, then the instructor can do so by covering only material from within areas B and C. Even in this 
case, students will still be exposed to the recursive-least-squares (RLS) algorithm and its performance 
results from the discussions in Chapter 14 and Area C. However, for a more-in-depth treatment of 
RLS and its many variants, instructors will need to select chapters from within Area D as well. 


DEPENDENCIES AMONG THE CORE PARTS 


Figure P.1 illustrates the dependencies among the eleven core parts in the book. In the figure, the 
material in a part that is at the receiving end of an arrow requires some (but not necessarily all) of 
the material from the part at the origin of the arrow. A dashed arrow indicates that the dependency 
between the respective parts is weak and, if desired, the parts can be covered independently of each 
other. For example, in order to cover Part Ill (Stochastic Gradient Methods), the instructor would 
need to cover Part || (Linear Estimation). The material in Part | (Optimal Estimation) is not necessary 
for Part 1! (Linear Estimation) but it is useful for a better understanding of it. Figure P.1 can be 


' MATLAB is a registered trademark of the MathWorks Inc., 24 Prime Park Way, Natick, MA 01760-1500, 
http:/ /www.mathworks.com. 
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p— TABLE P.1 A breakdown of the book structure into eleven core parts. 
PREFACE 
Parts Chapters 
A. Random Variables 
Background Material B. Linear Algebra 


C. Complex Gradients 

1. Scalar-Valued Data 

2. Vector-Valued Data 

3. Normal Equations 

4. Orthogonality Principle 
ll. Linear Estimation 5. Linear Models 

6. Constrained Estimation 

7. Kalman Filter 

8. Steepest-Descent Technique 

9. Transient Behavior 

10. LMS Algorithm 
Ill. Stochastic Gradient Methods 11. Normalized LMS Algorithm 
12. Other LMS-Type Algorithms 
13. Affine Projection Algorithm 
14. RLS Algorithm 
15. Energy Conservation 
16. Performance of LMS 
17. Performance of NLMS 
IV. Mean-Square Performance 18. Performance of Sign-Error LMS 
19. Performance of RLS and Other Filters 
20. Nonstationary Environments 


|. Optimal Estimation 


21. Tracking Performance 

22. Weighted Energy Conservation 

23. LMS with Gaussian Regressors 

24. LMS with Non-Gaussian Regressors 
25. Data-Normalized Filters 

26. Transform-Domain Adaptive Filters 
VI. Block Adaptive Filters 27. Efficient Block Convolution 

28. Block and Subband Adaptive Filters 
29. Least-Squares Criterion 

30. Recursive Least-Squares 

31. Kalman Filtering and RLS 

32. Order and Time-Update Relations 
33. Norm and Angle Preservation 

Vill. Array Algorithms 34. Unitary Transformations 

35. QR and Inverse QR Algorithms 

36. Hyperbolic Rotations 

37. Fast Array Algorithm 

38. Regularized Prediction Problems 

39. Fast Fixed-Order Filters 

40. Three Basic Estimation Problems 
4]. Lattice Filter Algorithms 

42. Error-Feedback Lattice Filters 

43. Array Lattice Filters 
44. Indefinite Least-Squares 
45. Robust Adaptive Filters 
46. Robustness Properties 


V. Transient Performance 


VII. Least-Squares Methods 


IX. Fast RLS Algorithms 


X. Lattice Filters 


XI. Robust Filters 


References and Indices 


TABLE P.2 A listing of all computer projects in the book. MATLAB programs that solve 
these projects can be downloaded by all readers from the publisher's or author's websites, in 


addition to a solutions manual. 


Computer project 
L1 
Il.1 


[LL Mt LLLI M il 


Topic 
Comparing optimal and suboptimal estimators 
Linear equalization and decision devices 


112 Beamforming. 

11.3 Decision-feedback equalization 
Iil.1 Constant-modulus criterion 

!i.2 Constant-modulus algorithm 
Ill.3 Adaptive channel equalization 
Ii.4 Blind adaptive equalization 
I.i Line echo cancellation 


Kcd Esp IB aor dimid diei C s 


IV.2 Tracking Rayleigh fading channels 
VA Transient behavior of LMS 

MBI Acoustic echo cancellation 
VILA OFDM receiver 


Tracking Rayleigh fading channels 


Performance of array implementations in finite precision 


Stability issues in fast least-squares 
Performance of lattice filters in finite precision 
Active noise control 


TABLE P.3 A breakdown of the book structure into five broad topic areas. 


Category 


A. Introduction and Foundations 


B. Stochastic-Gradient Methods 


C. Performance Analyses 


D. Least-Squares Methods 


Parts 
Part |: Optimal Estimation 
Part tl: Linear Estimation 
Part tll: Stochastic-Gradient Methods 
Part VI: Block Adaptive Filters 
Part IV: Mean-Square Performance 
Part V: Transient Performance 
Part VII: Least-Squares Methods 
Part VIII: Array Algorithms 
Part IX: Fast RLS Algorithms 
Part X: Lattice Filters 


E. Indefinite Least-Squares 


Part XI: Robust Filters 


used by instructors to design different course sequences according to their needs and interests. For 
example, if the instructor is interested in covering only LMS-type adaptive filters and in studying 
their performance, then one possibility is to cover material from within Parts II, III, IV, and V. 


AUDIENCE 


The book is intended for a graduate-level course on adaptive filtering. Although it is beneficial that 
students have some familiarity with basic concepts from matrix theory, linear algebra, and random 
variables, the book includes three chapters on background material in these areas. The review is 
done in a motivated manner and is tailored to the needs of the presentation. From our experience, 
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FIGURE P.1 Dependencies among the chapters. Instructors can design different course 
sequences in accordance with their needs and interests. 


these reviews are sufficient for a thorough understanding of the discussions in the book. In addition, 
several of the problems reinforce the linear algebraic and matrix concepts, so much so that students 
will get valuable training in linear algebra and matrix theory, in addition to adaptive filtering, from 
reading (and understanding) this book. 

The book is also intended to be a reference for researchers, which explains why we have chosen 
to include some advanced topics in a handful of places. As a result, the book contains ample material 
for instructors to design courses according to their interests. Clearly, we do not expect instructors to 
cover all the material in the book in a typical course offering; such an objective would be counter- 
productive. In our own teaching of the material, we instead focus on some key sections and chapters 
and request that students complement the discussions by means of reading and problem solving. As 
explained below, several key sections have been designed to convey the main concepts; while the 
remaining sections tend to include more advanced material and also illustrative examples. Once 
students understand the basic principles, you will be amazed at how well they can follow the other 
lectures on their own and even solve the pertinent problems. 

To facilitate course planning, Table P.4 lists the key chapters or sections from the various core 
parts of the book for both lecturing and reading purposes. For example, Part V (Transient Perfor- 
mance) studies the transient behavior of a large family of adaptive filters in a uniform manner. The 
main idea is captured by the transient analysis of the LMS algorithm in Chapters 23 and 24; these 
chapters rely on the machinery developed in Chapter 22. Once students understand the framework as 
applied to LMS, they will be able to study the transient analysis of other filters on their own. This is 


one key advantage of adopting and emphasizing a uniform treatment of adaptive filter performance 
throughout our presentation. Similar remarks hold for the steady-state and tracking analyses of Part 
IV (Mean-Square Performance). Tt is sufficient to illustrate how the methodology applies to the spe- 
cial case of LMS, for example, by covering Chapters 15 and 16, as well as Sec. 21.1. Students would 
then do well in studying the extensions on their own if desired. 


TABLE P.4 A suggested list of key chapters and sections. 


Part Key chapters for lecturing | Key chapters for reading 
Part | Chapters 1 and 2 
Part II Chapters 3, 4 Sections 5.2, 5.3, 5.5, 6.3, 6.5 


Sections 5.1, 5.4, 6.1,6.2,6.4 | Chapter 7 


Part IIl Chapters 8, 9, 10, 11, 14 Chapters 12, 13 


Part IV Chapters 15, 16, 20, 21.1 Chapters 17, 18, 19, 21.2-21.8 
Part V Chapters 22, 23, 24 Chapter 25 

Part VI Chapters 26, 27, 28 

Part VII | Chapters 29, 30 Chapters 31, 32 

Part Vill | Chapters 33, 34, 35 

Part IX Chapters 36, 37 Chapters 38, 39 

Part X Chapters 40, 41 Chapters 42, 43 

Part XI Chapters 44, 45 Chapter 46 


SOME FEATURES OF OUR TREATMENT 


There are some distinctive features in our treatment of adaptive filtering. Among other elements, 
experts will be able to notice the following contributions: 
(a) We treat a large variety of adaptive algorithms. 


(b) Parts IV and V study the mean-square performance of adaptive filters by resorting to energy- 
conservation arguments. While the performance of different adaptive filters is usually studied 
separately in the literature, the framework adopted in this book applies uniformly across dif- 
ferent classes of adaptive filters. In addition, the same framework is used for steady-state 
analysis, transient analysis, tracking analysis, and robustness analysis (in Part XI). 


(c) Part V! studies block adaptive filters, and the related class of subband adaptive filters, in a 
manner that clarifies the connections between these two families more directly than prior 
treatments. Our presentation also indicates how to move beyond DFT-based transforms and 
how to use other classes of orthogonal transforms for block adaptive filtering (as explained in 
Chapter 10 of Sayed (2003)). 


Parts VII-IX provide a detailed treatment of least-squares adaptive filters that is distinct from 
prevailing approaches in a handful of respects. First, we focus on regularized least-squares 
problems from the onset and take the regularization factor into account in all derivations. 
Second, we insist on deriving time- and order-update relations independent of any structure 
in the regression data (e.g., we do not require the regressors to arise from a tapped-delay-line 
implementation). In this way, one can pursue efficient least-squares filtering even for some 
non-FIR structures (as explained in Chapter 16 of Sayed (2003)). Third, we emphasize the 
role and benefits of array-based schemes. And, finally, we highlight the role of geometric 
constructions and the insights they bring into least-squares theory. 


(d 


— 


(e) Part XI develops the theory of robust adaptive filters by studying indefinite least-squares prob- 
lems and by relying on energy arguments as well. In the process, the robustness and optimality 
properties of several adaptive filters are clarified. The presentation in this part is developed 
in a manner that parallels our treatment of least-squares problems in Chapters 29 and 30 so 
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PREFACE least-squares versus indefinite least-squares). 
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NOTATION 


T.. objective of this book is not limited to providing a treatment of adaptive filters, but also to 
bring forth connections between adaptive filtering and other filtering theories. To do so, it becomes 
necessary to adopt a notation that captures with relative ease the similarities and connections that 
exist among different filtering theories. One main reason behind our choice of notation is that in our 
treatment of adaptive filters we need to distinguish between four types of variables: 


random, scalar, vector, and matrix variables 


While in many treatments of adaptive filters, no notational distinction is usually made between ran- 
dom quantities and their realizations, it is nevertheless important in our treatment of the subject to 
distinguish between the stochastic and nonstochastic domains. This distinction allows us, among 
other results, to describe with more transparency the connections that exist between filters that are 
derived in the stochastic domain (e.g., Kalman filters) and filters that are motivated by working with 
signal realizations (e.g., least-squares filters). 

Once the reader becomes familiar with our convention, it will be straightforward to deduce the 
nature of a variable appearing in an equation by simply recalling the rules listed below. Whenever 
exceptions to these rules are used in the text, they will be obvious from the context. In some rare 
instances, rather than insist on following a strict convention, we may opt to relax our notation in a 
manner that best suits the discussion at hand. In general, however, the notation is consistent and 
motivated, and it should not be a hurdle to any attentive reader. 

The following is a list of the notational conventions used in the textbook: 

(a) We use boldface letters to denote random variables and normal font letters to denote their 

realizations (i.e., deterministic values), like x and z, respectively. In other words, we reserve 
the boldface notation to random quantities. 


(b) We use CAPITAL LETTERS for matrices and small letters for both vectors and scalars, for 
example, X and z, respectively. In view of the first convention, X would denote a matrix with 
random entries, while X would denote a matrix realization (i.e., a matrix with deterministic 
values). Likewise, 2 would denote a vector with random entries, while z would denote a 
vector realization (or a vector with deterministic values). One notable exception to the capital 
letter notation is the use of such letters to refer to filter orders and to the total number of data 
points. For example, we usually write M to denote the number of taps in a filter and N to 
denote the total number of observations. These exceptions will be obvious from the context. 


(c) We use parentheses to denote the time dependency of a scalar quantity and subscripts to 
denote the time dependency of a vector or matrix quantity; for example, if d is a scalar then 
d(i) refers to its value at time (or iteration) 7. On the other hand, if u is a vector, then u; 
denotes its value at time (or iteration) 7. Thus by looking at d(i) and u;, it is easy in this 
manner to distinguish between which one is a scalar and which one is a vector: d(i) is a scalar 
and u; is a vector. The time-dependency notation is useful to distinguish between scalars and 
vectors. 


XXV 


XXVI (d) We use the superscript T to denote transposition and the superscript * to denote transposition 
NOTATION with complex conjugation; for example, if 


then 
Bia (OV md BS * Y 
| B A B* x* 
(e) All vectors in our presentation are column vectors with one notable exception (starting from 
Chapter 10 onwards). We choose to represent the regression vector u; as a row vector. Al- 


though this convention for the regressor is not essential, we have found it to be convenient for 
the following reasons: 


(e.1) We shall frequently encounter in our discussions the inner product between the regressor 
uz and some weight column vector w. In this way, we can simply write the inner product 
as u;w without the need for a transposition symbol. 


(e.2) Usually, the regressor u; arises as the state vector of a tapped delay line, as is shown in 
Fig. N.i for a finite-impulse-response channel, say, of order M and weight vector w, 


ui = [ ui) ui-1) ... wi—-M-1) | (adopted convention) 


FIGURE N.1 A tapped delay line structure. The state of the channel is defined as the vector 
that contains the entries at the outputs of the delay elements, in addition to the input signal. 
With u; defined as a row vector, the noisy output of the channel will be described by 
d(i) = uw + v(i) 
Had we defined u; as a column vector instead, say, as 


u(i) 
u(i — 1) 
ui = : 
u(i — M +1) 
then the output of the channel would have been described by 
dli) = ulw + v(i) 


with the transposition symbol used. Of course, the use of the transposition symbol is not 
of any consequence in its own right, especially when dealing with real data. However, 
we shall often deal with complex data (e.g., in channel equalization applications). In 


(e.3) 


(e.4) 


such situations, in addition to ul, we shall also encounter terms involving uf in our 
developments such as when defining the covariance matrix of the regression data. In 
this way, both transposition superscripts, * and T, will appear in the discussions. We 
prefer to stick to a single symbol for transposition. By defining u; as a row vector, 
we avoid such situations and use almost exclusively the complex-conjugation symbol * 
throughout the presentation in the book. 


Alternatively, we could have defined the regressor as the following column vector: 
u"(i) 
u" (i — 1) 


u* (i — M +1) 


with conjugated entries {u*(j)} instead of the actual entries (u(j)). In this case, the 
output of the channel would have been described by 


d(i) = ujw + v(i) 


for both cases of real and complex data and, therefore, only terms involving uj will 
appear in our discussions (and not both uj and ul). Nevertheless, if we define u; as 
above, then the entries of u; will not relate directly to the data stored in the channel (i.e., 
to the state of the channel) but to their conjugate values, and the designer will need to 
keep this fact in mind during simulation and algorithm coding. 


Finally, the notation u;w is convenient for MATLAB simulations of adaptive filters 
(e.g., in the computer projects used throughout the book). The designer would need to 
be more careful in coding ujw as opposed to u;w. 


For these reasons, we have opted to define the regressor u; as a row vector. All other vectors 
in our treatment are column vectors. In our teaching of the subject over the years, we have 
found that students adapt very well to this convention. We have made every effort to make the 
notation consistent and coherent for the benefit of the reader. Please excuse imperfections. 
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SYMBOLS 


We collect here, for ease of reference, a list of the main symbols used throughout the text. 


R 
Rt 


vec 
diag( A) 
diag{a} 
diag(a, b) 
a@b 


set of real numbers 

set of positive real numbers 

matrix transposition 

complex conjugation for scalars; Hermitian transposition for matrices 
boldface letter denotes a random scalar or vector variable 
boldface capital letter denotes a random matrix 

letter in normal font denotes a vector or a scalar in Euclidean space 
capital letter in normal font denotes a matrix in Euclidean space 
expected value of the random variable x 

orthogonal random variables a and y (i.e., Ez" = 0) 
orthogonal vectors x and y (i.e., z"y = 0) 

a” x for a column vector x; squared Euclidean norm of x 
v/a*z for a column vector x; Euclidean norm of x 

z"'Wz for a column vector z and positive-definite matrix W 
maximum singular value of A; also the spectral norm of A 
Frobenius norm of A 

quantity a is defined as b 

real part of x 

imaginary part of x 

zero scalar, vector, or matrix 

identity matrix of size n x n 

column vector with entries a and b 

column vector formed by stacking the columns of A 

Square matrix formed by unstacking its columns from a 
column vector with the diagonal entries of A 

diagonal matrix with entries read from the column a 
diagonal matrix with diagonal entries a and b 

same as diag{a, b) 

Kronecker product of A and B 

pseudo-inverse of A 

positive-definite matrix P 

positive-semidefinite matrix P 

square-root factor of a matrix P > 0, usually lower triangular 
means that A — B is positive-definite 

means that A — B is positive-semidefinite 

determinant of the matrix A 

trace of the matrix A 

constant multiple of M, or of the order of M 

logarithm of a relative to base 10 

natural logarithm of a 

end of a theorem/lemma/proof/example/remark 

theorem 

definition 

problem 

example 

figure 
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NOTATION 
QR factorization of a matrix 
lower-diagonal-upper decomposition of a matrix 
upper-diagonal-lower decomposition of a matrix 
LDU decomposition of a Hermitian matrix 
UDL decomposition of a Hermitian matrix 
linear time-invariant 
linear least-mean-squares estimation/estimator 
minimum mean-square error 
independent and identically distributed 
probability density function 
almost everywhere 
autoregressive model 
moving average mode! 
autoregressive moving average mode! 
finite impulse response filter 
infinite impulse response filter 
signal to noise ratio 
unit-time delay operator 
bilateral z-transform of a scalar sequence (z(i)) 
discrete-time Fourier transform of (z(i)) 
bilateral z-transform of a vector sequence (z(i)) 
LI.m.s. estimator of x; given observations {y,,} up to time j 
LL.m.s. estimator of æ; given observations {y,,} up to time i — 1 
estimation error z; — 2i; 
estimation error x; — ĉ; 
weight estimate at time or iteration z (a column vector) 
weight error vector at time or iteration ? (a column vector) 
regressor at time or iteration 7 (a row vector) 
output estimation error at time or iteration 2 
a priori estimation error at time or iteration 2 
a posteriori estimation error at time or iteration i 
covariance matrix of the regression data 
noise variance 
value of a scalar variable d at time or iteration ¿ 
value of a vector variable d at time or iteration 2 
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CHAPTER A 


Random Variables 


The presentation in the book relies on some basic concepts from probability theory and random 
variables. For the benefit of the reader, we shall motivate these concepts whenever needed as well 
as highlight their relevance in the estimation context. In this way, readers will be introduced to 
the necessary concepts in a gradual and motivated manner, and they will come to appreciate their 
significance away from unnecessary abstractions. In this initial chapter, we collect several concepts 
of general interest. These concepts complement well the material in future chapters and will be called 
upon at different stages of the discussion. 


A.1 VARIANCE OF A RANDOM VARIABLE 


We initiate our presentation with an intuitive explanation for what the variance of a random variable 
means. The explanation will help the reader appreciate the value of the least-mean-squares criterion, 
which is used extensively in later chapters. 

Consider a scalar real-valued random variable x with mean value Z and variance c2, i.e., 


(A.1) 


where the symbol E denotes the expectation operator. Observe that we are using boldface letters 
to denote random variables, which will be our convention in this book. When æ has zero mean, 
its variance is simply c2. = Ex”. Intuitively, the variance of æ defines an interval on the real axis 
around < where the values of a are most likely to occur: 


1. A small c? indicates that a is more likely to assume values that are close to its mean, Z. 


2. A large c? indicates that 2 can assume values over a wider interval around its mean. 


For this reason, it is customary to regard the variance of a random variable as a measure of uncertainty 
about the value it can assume in a given experiment. A small variance indicates that we are more 
certain about what values to expect for x (namely, values that are close to its mean), while a large 
variance indicates that we are less certain about what values to expect. These two situations are 
illustrated in Figs. A.1 and A.2 for two different probability density functions. 

Figure A.1 plots the probability density function (pdf) of a Gaussian random variable x for two 
different variances. In both cases, the mean of the random variable is fixed at Z = 20 while the 
variance is c2 = 225 in one case and c2 = 4 in the other. Recall that the pdf of a Gaussian random 
variable is defined in terms of (z, 02) by the expression 
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Two Gaussian distributions 


FIGURE A.1 The figure shows the plots of the probability density functions of a Gaussian random variable x 
with mean Z = 20, variance c2 = 225 in the top plot, and variance c2 = 4 in the bottom plot. 


where c; is called the standard deviation of x. Recall further that the pdf of a random variable is 
useful in several respects. In particular, it allows us to evaluate probabilities of events of the form 


Pees nn 


ie., the probability of z assuming values inside the interval [a, b]. From Fig. A.1 we observe that the 
smaller the variance of x, the more concentrated its pdf is around its mean. 

Figure A.2 provides similar plots for a random variable with a Rayleigh distribution, namely, 
with a pdf given by 


fe(z) = 3 e 7.20, a>0 (A.3) 


where o is a positive parameter that determines the mean and variance of x according to the expres- 
sions (see Prob. I.1): 

2—a a = (2-2) o? (A.4) 
Observe, in particular, and in contrast to the Gaussian case, that the mean and variance of a Rayleigh- 
distributed random variable cannot be chosen independently of each other since they are linked 
through the parameter a. In Fig. A.2, the top plot corresponds to Z = 1 and o2 = 0.2732, while the 
bottom plot corresponds to Z = 3 and c2 = 2.4592. 

These remarks on the variance of a random variable can be further qualified by invoking a well- 
known result from probability theory known as Chebyshev's inequality — see Probs. 1.2 and 1.3. 
The result states that for a random variable a with mean Z and variance c2, and for any given scalar 
6 > 0, it holds that 


P(|æ -2| 28) < 02/6 (A.5) 


That is, the probability that a assumes values outside the interval (z — 6,Z + ô) does not exceed 
c2 /5”, with the bound being proportional to the variance of a. Hence, for a fixed ô, the smaller the 
variance of æ the smaller the probability that x will assume values outside the interval (Z — ô, +6). 
Choose, for instance, ô = 50+. Then (A.5) gives 


P(|jæ -3| > 502) < 1/25=4% 


Two Rayleigh distributions 
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FIGURE A.2 The figure shows the plots of the probability density functions of a Rayleigh random variable x 
with mean @ = 1 and variance 02. = 0.2732 in the top plot, and mean Z = 3 and variance c2 — 2.4592 in 
the bottom plot. 


In other words, there is at most 4% chance that z will assume values outside the interval (Z — 
50r, & + 50r). 

Actually, the bound that is provided by Chebyshev's inequality is generally not tight. Consider, 
for example, a zero-mean Gaussian random variable æ with variance c2 and choose 6 = 2c,. Then, 
from Chebyshev's inequality (A.5) we would obtain 


P(lm|»2o,) € 1/4 25 


whereas direct evaluation of the integral 
A 1 ais ~2? /202 ) 
P(ix| > 20) = 1-2 d e z dz 
RE (=z. 


yields 
P(|x| > 2o.) ^ 4.56% 


Remark A.1 (Zero-variance random variables) One useful consequence of Chebyshev’s inequality is 
the following. It allows us to interpret a zero-variance random variable as one that is equal to its mean with 


probability one. That is, 
o2=0 => w= with probability one 


This is because, for any small 6 > 0, we obtain from (A.5) that 
P(je—Z 26) <0 


But since the probability of any event is necessarily a nonnegative number, we conclude that P(|ja—Z| > à) = 0, 
for any ô > 0, so that & = Z with probability one. We shall call upon this result on several occasions (see, e.g., 
the proof of Thm. 1.2). 


o 


A.2 DEPENDENT RANDOM VARIABLES 


In estimation theory, it is generally the case that information about one variable is extracted from ob- 
servations of another variable. The relevance of the information extracted from the observations is a 
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function of how closely related the two random variables are, as measured by relations of dependency 
or correlation between them. 


The dependency between two real-valued random variables {a, y} is characterized by their joint 
probability density function (pdf). Thus, let fx, (z, y) denote the joint (continuous) pdf of x and y; 
this function allows us to evaluate probabilities of events of the form: 


d b 
Plase<bosy<a) =Í j. fo.y(2,y)dedy 


namely, the probability that a and y assume values inside the intervals {a, b] and [c, d], respectively. 
Let also fj, (z|y) denote the conditional pdf of æ given y; this function allows us to evaluate 
probabilities of events of the form 


b 
Pla<e<bly=y) =f few (zly)dz 


namely, the probability that a assumes values inside the interval [a, b] given that y is fixed at the 
value y. It is known that the joint and conditional pdfs of two random variables are related via 
Bayes’ rule, which states that 


fo,y (2, y) = fy) Fey (ly) = felz) fuie (vla) (A.6) 


in terms of the probability density functions of the individual random variables x and y. 


The variables (a, y) are said to be independent if 
feiy(tly) = fe(z) and = fyje(ylz) = fy(y) 


in which case the pdfs of x and y are not modified by conditioning on y and a, respectively. Oth- 
erwise, the variables are said to be dependent. In particular, when the variables are independent, it 
follows that Exy = ExEy. It also follows that independent random variables are uncorrelated, 
meaning that their cross-correlation is zero as can be verified from the definition of cross-correlation: 


Gry Ê E(rz-E(y- 
Ezy — zy 
Ex Ey- zy 
0 


The converse statement is not true: uncorrelated random variables can be dependent. Consider the 
following example. Let 0 be a random variable that is uniformly distributed over the interval [0, 27}. 
Define the zero-mean random variables œ = cos@ and y = sin@. Then x? + y? = 1 so that æ 
and y are dependent. However, Exy = E cos@sin@ = 0.5E sin20 = 0, so that x and y are 
uncorrelated. 


A.3 COMPLEX-VALUED RANDOM VARIABLES 


Although we have focused so far on real-valued random variables, we shall often encounter applica- 
tions that deal with random variables that assume complex values. Accordingly, a complex-valued 
random variable is defined as one whose real and imaginary parts are real-valued random variables, 
say, 


r-z jm, j 2 y 


where æ, and æ; denote the real and imaginary parts of x. Therefore, the pdf of a complex-valued 
random variable a can be characterized in terms of the joint pdf, fz, s, (^, +), of its real and imaginary 
parts. This means that we can regard a complex random variable as a function of two real random 


variables. The mean of æ is 
Ez Ê Ea, + jEm, 
= e + ji 


in terms of the means of its real and imaginary parts. The variance of æ, on the other hand, is defined 
as 


(A.7) 


where the symbol « denotes complex conjugation. 

Comparing with the definition (A.1) in the real case, we see that the above definition is different 
because of the use of conjugation (in the real case, the conjugate of (a — Z) is (x — Z) itself and 
the above definition reduces to (A.1)). The use of the conjugate term in (A.7) is necessary in order 
to guarantee that c2 will be a nonnegative real number. In particular, it is immediate to verify from 


(A.7) that 
c2 - ci + ož, 


in terms of the individual variances of x, and zi. 
We shall say that two complex-valued random variables x and y are uncorrelated if, and only if, 
their cross-correlation is zero, i.e., 


A à —\* 
ozy = E(v-£)(y-g) =0 
On the other hand, we shall say that they are orthogonal if, and only if, 
Ezy' =0 


It can be immediately verified that the concepts of orthogonality and uncorrelatedness coincide if at 
least one of the random variables has zero mean. 


Example A.t (QPSK constellation) 
Consider a signal æ that is chosen uniformly from a QPSK constellation, i.e., x assumes any of the 
values +/2/2 + j/2/2 with probability 1/4 (see Fig. A.3). Clearly, z is a complex-valued random 


variable; its mean and variance are easily found to be Z = 0 and o2 = 1. 


Imaginary part 


y2: 2 Real part 


FIGURE A.3 A QPSK constellation. 
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A.4 VECTOR-VALUED RANDOM VARIABLES 


A vector-valued random variable is a collection (in column or row vector forms) of random vari- 
ables. The individual entries can be real or complex-valued themselves. For example, if a = 
col{x(0), #(1)} is a random vector with entries z(0) and z(1), then we shall define its mean as 
the vector of individual means, 


eo pee z(0) | a | Ex(0) 
z(1) Ea(1) 
and its covariance matrix as 

Re Ê E(r-z)r-z)y (A.8) 
where the symbol * now denotes complex-conjugate transposition (i.e., we transpose the vector and 
then replace each of its entries by the corresponding conjugate value). Note that we are using paren- 
theses to index the scalar entries of a vector, e.g., x(k) denotes the kth entry of a. Moreover, if æ 
were à row random vector, rather than a column random vector, then its covariance matrix would 


instead be defined as x 
R: = E(r-ziz)(mx-z) 
with the conjugate term coming first. This is because, in this case, it is the product (a — z)*(z — 1) 
that yields a matrix, while the product (a — Z)(a — z)* would be a scalar. 
For the two-element column vector z = col{x(0), z(1)) we obtain 


s E|z(0) - z(0)]? E[z(0) — z(0)[z(1) - z())* 
E [a(1) — z(1)[z(0) — 2(0)]* E|z(1) - 2(1)/? 
with the individual variances of the variables (z(0), x(1)) appearing on the diagonal and the cross- 


correlations between them appearing on the off-diagonal entries. In the zero-mean case, the definition 
of Rz, and the above expression, simplify to 


Elz(0|?  Ez(0)z'(1) 


and 


Ry = 
Ex(1)a*(0)  Elz(1)? 


It should be noted that the covariance matrix Rz is Hermitian, i.e., it satisfies 
Re = RI 
Moreover, Rs is a nonnegative-definite matrix, written as 
Rz>0 


By definition, a Hermitian matrix R is said to be nonnegative definite if, and only if, a* Ra > 0 
for any column vector a (real or complex-valued). In order to verify that E; > 0, we introduce the 
scalar-valued random variable y = a*(a — 7), where a is an arbitrary column vector. Then y has 
zero mean and 

ce? =Ely|? = a" Rza 
But since the variance of any scalar-valued random variable is always nonnegative, we conclude that 
a* Rza > 0 for any a. This means that Rz is nonnegative definite, as claimed. 


For real-valued data, the symbol » is replaced by the transposition symbol T, and Rz is defined 


as 
Rs Ê E(z-z)mr-z) 


In this case, the covariance matrix R, therefore becomes symmetric rather than Hermitian. In other 
words, it now satisfies 
R: = R} 


Moreover, Re continues to be nonnegative-definite, which now means that a! Rza > 0 for any real- 
valued column vector a. 


Example A.2 (Transmissions over a noisy channel) 


Consider the setting of Fig. A.4. A sequence of independent and identically distributed (i.i.d.) sym- 
bols (s(i)) is transmitted over an initially relaxed FIR channel with transfer function C(z) — 
1 + 0.5271, where z^! denotes the unit-time delay in the z-transform domain. Each symbol is 
either 4-1 with probability p or —1 with probability 1 — p. The output of the channel is corrupted 
by zero-mean additive white Gaussian noise v(i) of unit variance. The whiteness assumption means 
that E v(i)v" (j) = 6.3, where ĝi; denotes the Kronecker delta function that is equal to unity when 
i — j and zero otherwise. In other words, the noise terms are uncorrelated with each other. The 


noise v(i) and the symbols s(j) are also assumed to be independent of each other for all i and j. 


v(i) 


sli) —— æl 1405272 O— 9 
FIGURE A.4 Transmissions over an additive white Gaussian FIR channel. 


The output of the channel at time i is given by y(i) = s(i) + 0.58(i — 1) + v(i). Assume we 
collect N + 1 measurements, {y(i),i = 0,1,..., N}, into a column vector y, and then pose the 
problem of recovering the transmitted symbols (s(i),? = 0,1,..., N} over the same interval of 
time. If we collect the symbols {8(i)} into a column vector a, say x = col(s(0), s(1),..., s(N)), 
then we are faced with the problem of estimating a vector æ from a vector y. Here the entries of x 
and y are all real-valued. If the symbols s(2) were instead chosen from a QPSK constellation, then 
both z and y will be complex-valued. o 


A.5 GAUSSIAN RANDOM VECTORS 


Gaussian random variables play an important role in many situations, especially when we deal with 
the sum of a large number of random variables. In this case, a fundamental result in probability the- 
ory, known as the central limit theorem, states that under conditions often reasonable in applications, 
the probability density function (pdf) of the sum of independent random variables approaches that of 
a Gaussian distribution. Specifically, if (z(i), 4 = 1,2,..., N} are independent real-valued Gaus- 
sian random variables with means (Z(1)) and variances (02(i)) each, then the pdf of the normalized 
variable 


y = 
N B 
2 azli) 
isl 
approaches that of a Gaussian distribution with zero mean and unit variance, i.e., 
l^ uod 
fau) = jee"? as N— o 


Von 
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NR or, equivalently, 


CHAPTER A 1 P( < ) 1 f v! /29 
RANDOM im P(y < a) = -= Je V 
VARIABLES N09 VET 


It is for this reason that the term "Gaussian noise" generally refers to the combined effect of many 
independent disturbances. 

In this section we describe the general form of the pdf of a vector Gaussian random variable. 
However, as the discussion will show, we need to distinguish between two cases depending on 
whether the random variable is real or complex. In the complex case, the random variable will 
need to satisfy a certain circularity assumption in order for the given form of the pdf to be valid. 


Real-Valued Gaussian Random Variables 


We start with the real case. Thus, consider a p x 1 random vector 2 with mean 7 and a nonsingular 
covariance matrix 
Re = E(a@ -z)(r-z) 


We say that æ has a Gaussian distribution if its pdf has the form 


fo(t) = (2 — 3)" Rz (æ - 2)} (A.9) 


1 1 
(27)? vdet Re SS 


in terms of the determinant of Rz. Of course, when p = 1, the above expression reduces to the pdf 
considered in the text in (A.2) with Rs replaced by c2. 
Now consider a second q x 1 Gaussian random vector y with mean jj and covariance matrix 


R, =E(y -p(y - 7) 
so that its pdf is given by 


1 1 


V (2n) det Ry 


Let Rzy denote the cross-covariance matrix between a and y, i.e., 


fy) = exp {—4(y - 9) Rj(y- 3) 


Rey =E (æ - 3)(y - 7)" 


We then say that the random variables {æ, y) have a joint Gaussian distribution if their joint pdf has 
the form 


in terms of the covariance matrix R of col{x, y}, namely, 


e-([2]- [5 (FI 


It can be seen from (A.10) that the joint pdf of (2, y) is completely determined by the mean, covari- 
ances, and cross-covariance of (z, y }, i.e., by the first- and second-order moments {Z, y, Rz, Ry, Rzy}. 


Se 


«c SI 
Ss | 
real a 

4 
Il 
re 

SE 

cU 
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Complex-Valued Random Variables and Circularity 


Let us now examine the case of complex-valued random vectors. We consider again two real random 
vectors (z, y), both assumed of size p x 1, with joint pdf given by (cf. (A.10)): 


i d 1 E : Zac ge 
fou(®:¥) = Gap ARR oof} | (z-2) (y-g)" |R : | sy J} (A.11) 


Letz = x + jy, where 7 = /—1, denote a complex-valued random variable defined in terms of 
(2, y). Its mean is simply 


Z-Ez-fcj 


while its covariance matrix is 
R, = E(z — Z)(z — Z)" = (Re + Ry) + j(Rys — Rey) (A.12) 


which is expressed in terms of the covariances and cross-covariance of {x, y]. 

We shall say that the complex variable z has a Gaussian distribution if its real and imaginary parts 
(2, y) are jointly Gaussian. Since z is a function of {z, y}, its pdf is characterized by the joint pdf 
of {x, y) as in (A.11), i.e., in terms of (2, 9, Rz, Ry, Rzy}. However, we would like to express 
the pdf of z in terms of its own first- and second-order moments, i.e., in terms of (Z, R+}. It turns 
out that this is not always possible. This is because knowledge of (z, Rz} alone is not enough to 
recover the moments {%, 7, Rz, Ry, Rzy}. More information is needed in the form of a circularity 
condition. 

To see this, assume we only know (z, Rz}. Then this information is enough to recover (2, y) 
since Z = % + jg. However, the information is not enough to recover {Rz, Ry, Rey}. This is 
because, as we see from (A.12), knowledge of Ft; allows us to recover the values of (R+ + Ry) and 
(Rye — Rey) via 

R: + Ry = Re(R) Ryz — Roy = Im(R2) (A.13) 
in terms of the real and imaginary parts of R,. This information is not sufficient to determine the 
individual covariances (Rz, Ry, Rey). 

In order to be able to uniquely recover {Rz, Ry, Rzy} from Rz, it is generally further assumed 
that the random variable z satisfies a circularity condition. This means that z should satisfy 


E(z-z)(z-z) =0 (circularity condition) 


with the transposition symbol T used instead of Hermitian conjugation. Knowledge of R, and this 
circularity condition are enough to recover {Rz, Ry, Rey} from Rz. Indeed, using the fact that 


E(z—2)(z-2)' = (Re — Ry) + j(Rys + Rey) 


we find that, in view of the circularity assumption, it must hold that Re = Ry and Rey = —Ryz. 
Consequently, combining with (A.13), we can solve for (Rz, Ry, Rzy} to get 

1 i 
R: = Ry = 5 Re(R;) and Ray = -Rys = 35 im(R;) (A.14) 


It follows that the covariance matrix of col (z, y) can be expressed in terms of R+ as 


R= 


1| Re(Rz) —Im(R.) 
2| Im(R) Re(R;) 
Actually, it also follows that R should have the symmetry structure 

R= | Re Ray | (A.15) 


with the same matrix Rv appearing on the diagonal, and with Rzy and its negative appearing at the 
off-diagonal locations. Observe further that when z happens to be scalar-valued, then Rzy becomes 
a scalar, say, Czy, and the condition Rry = —Ryz can only hold if czy = 0. That is, the real and 
imaginary parts of z will need to be independent in the scalar case. 


Using the result (A.15), we can now verify that the joint pdf of {a, y) in (A.11) can be rewritten 
in terms of {2, R+} as shown below — compare with (A.9) in the real case. Observe in particular 
that the factors of 2, as well as the square-roots, disappear from the pdf expression in the complex 
case. 
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Probability density function of a circular Gaussian random variable 


f 
NT 


0.15 


0.1. À 


ao 
N 


FIGURE A.5 A typical plot of the probability density function of a zero-mean scalar and circular Gaussian 
random variable. 


Lemma A.1 (Circular Gaussian random variables) The pdf of a complex val- 
ued circular (or spherically invariant) Gaussian random variable z of dimension p is 
given by 


1 1 -(z-z)'Rr;l(z-z 
fz02) = z2 det R; expl 679" 5; 6-0) (A.16) 


Proof: Using (A.15) we have 
det R = det(Rz) - det(Rz + Rey Rz! Rey) 
Likewise, using the expression Rz = 2[Rz — j Rz], we obtain 


[det Rz]? = det(Rz)-det(R!) 
2°? det[Re(I — jRz! Rzy)] : det(Rz — jRL,) 
But 
RI = Ryz = —Rey 
and, for matrices A and B of compatible dimensions, det(AB) = det(B A). Hence, 
[det Rz]? = 2% det Re det|(Re +jRey)(I — jRz | Rzy)| 
= 2°? det(Rz)-det(Re + Rz Rz; Rey) 


We conclude that det R = 2—??(det R,)?. Finally, some algebra will show that the exponents in (A.11) and 
(A.16) are identical. 
o 


Figure A.5 plots the pdf of a scalar zero-mean complex-valued and circular Gaussian random 
variable z using R = Iz, i.e., 02 = c2 = 1 and Gzy = 0 so that o? = 2. Therefore, in this example, 
the real and imaginary parts of z are independent Gaussian random variables with identical variances 

When (A.16) holds, we can check that uncorrelated jointly Gaussian random variables will also 
be independent; this is one of the main reasons for the assumption of circularity. 


Two Fourth-Order Moment Results 


We establish below two useful results concerning the evaluation of fourth-order moments of Gaus- 
sian random variables, in both cases of real and complex-valued data. Although these results will 


only be used later in Parts IV (Mean-Square Performance) and V (Transient Performance) when we 
study the performance of adaptive filters, we list them here because their proofs relate to the earlier 
discussion on Gaussian random variables. 


Lemma A.2 (Fourth-moment of real Gaussian variables) Let x be a real-valued 
Gaussian random column vector with zero-mean and a diagonal covariance matrix, 
say, Emm! = A. Then for any symmetric matrix W of compatible dimensions it 
holds that 


E (za Wa") = ATr(WA) +2AWA (A.17) 


Proof: The argument is based on the fact that uncorrelated Gaussian random variables are also independent, 
so that if æ(i) is the ith element of æ, then a (i) is independent of x(j) for i # j. Now let S denote the desired 


matrix, i.e., S = E (zz Waa). and let S;; denote its (i, j)th element. Assume also that a is p-dimensional. 
Then 


p-1p-1 
Sij =E 4 æ(i)æ(j) E p» z(m)Wmna(n) 
m=0n=0 
The right-hand side is nonzero only when there are two pairs of equal indices {i = j, m= n} or {i = m, j = 
n) or {i = n, j = m). Assume first that i = j (which corresponds to the diagonal elements of S). Then the 
expectation is nonzero only form = n, i.e., 


p-1 p-1 
Si -E [po E c) = X WinmE (z?(i)g?(m)) = AT(WA) + 2WaA? 
=0 


m=0 


where we used the fact that for a zero-mean real scalar-valued Gaussian random variable a we have Eat = 
3(E a?) ?- 3c, where c2 denotes the variance of a, c2 = E a?. We are also denoting the diagonal entries of 

For the off-diagonal elements of S (i.e., for i # j), we must have either i = n, j = m, ori = m, j = n, so 
that 


E (e(0eG) (2(0Wi;mG))] + E (eG)e G) (e)W;im))) 
(wi; T WE {2?(iw*(j)} = (wi T Wri) Aid; 


Using the fact that W is symmetric, so that W;; = Wi, and collecting the expressions for 5;;, in both cases of 
i = j andi Æ j, into matrix form we get the desired result (A.17). o 


The equivalent result for complex-valued circular Gaussian random variables is the following. 
The only difference is the factor of 2 in (A.17). This is because the fourth-order moment of a zero- 
mean complex scalar-valued circular random variable a, of variance c2 = E |a|?, is now given by 


Ela|* = 2(E laj?) = 2g4 — see, e.g., App. 1.B of Sayed (2003). 


Lemma A.3 (Fourth-moment of complex Gaussian variables) Let z be a cir- 
cular complex-valued Gaussian random column vector with zero-mean and a diag- 
onal covariance matrix, say, Ezz“ = A. Then for any Hermitian matrix W of 
compatible dimensions it holds that 


E zz Was) = ATr(WA) + AWA (A.18) 
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Tas the book, the reader will be exposed to a variety of concepts from linear algebra and 
matrix theory in a motivated manner. In this way, after progressing sufficiently enough into the book, 
readers will be able to master many of the useful concepts described herein. 


B.1 HERMITIAN AND POSITIVE-DEFINITE MATRICES 


Hermitian matrices. The Hermitian conjugate, A*, of a matrix A is the complex conjugate of its 
transpose, e.g., 


: 1 =j " 1 2-j . 
if A= then A* = , where j=v-1 
E A ii | 5 


A Hermitian matrix is a square matrix satisfying A” = A, e.g., 


PC DEC UE S MEE PN 
1-3 1 1-3 1 


so that A is Hermitian. 


Spectral decomposition. Hermitian matrices can only have real eigenvalues. To see this, assume 
u; is an eigenvector of A corresponding to an eigenvalue A;, i.e., Au; = A;u;. Multiplying from the 
left by ut we get uf Au; = A;||ui||?, where || - || denotes the Euclidean norm of its argument. Now 
the scalar quantity on the left-hand side of this equality is real since it coincides with its complex 
conjugate, namely (uj Ávu;)' = uj A*u; = uj Auj. Therefore, A; must be real too. 

Another important property of Hermitian matrices, whose proof requires a more involved argu- 
ment, is that such matrices always have a full set of orthonormal eigenvectors. That is, if A is n x n 
Hermitian, then there will exist n orthonormal eigenvectors u; satisfying 


Au; = Miui, lul 21, utu; =0 fori #5 


In compact matrix notation we can write this so-called spectral (or modal or eigen-) decomposition 


of A as 
A =UAU* 
where A = diag{A1, A2,...,An},U = [ ul Uz .. Un ] anau satisfies 


UU* =U*U =I 


We say that U is a unitary matrix. Here, the notation diag{a,b} denotes a diagonal matrix with 
diagonal entries a and b. 
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Lemma B.1 (Rayleigh-Ritz characterization of eigenvalues) If A is ann xn 
Hermitian matrix, then it holds that for all vectors z, 


Aminlzll? € z* Az € Anaxl|z|[? 


as well as 


* * 
i x* Ar j z“ Ar 
Amin = min ( ) = min 2*Az, Amax = max ( ) = max zT“ Ar 


x#0 ba lelļ=1 PLU EET |zl| 1 


where (Amin, Amax} denote the smallest and largest eigenvalues of A. The ratio 
z* Ax/z*x is called the Rayleigh-Ritz ratio. 


Proof: Let A = UAU*, where U is unitary and A has real entries, and define y = U*z for any vector z. 


n 
z'Az = a"UAU*z = y*Ay = Y Agly(h)|? 
k=1 
with the {y(k)} denoting the individual entries of y. Now since the squared terms (|y(k)|?) are nonnegative, 
we get 


n n n 
Amin > lulk)? < Y^ Arlu)? < Amax X. ly)? 
k=1 k=1 k=1 
or, equivalently, Amin {[y||? € 2* Az < Amax|lyl|?. Using the fact that U is unitary and, hence, 
lyll? = yty = zUU*z = [zl 
=I 


we conclude that Amin|lzli? € 2* Av < Amax!|r||?. The lower and upper bounds are achieved when z is 
chosen as the eigenvector corresponding to Amin Or Amax, respectively. 


o 


Positive-definite matrices. An n x n Hermitian matrix A is positive-semidefinite (also called 
nonnegative definite) if it satisfies z' Ax > 0 for all column vectors æ. It is positive-definite if 
z' Ax > 0 except when z = 0. We denote a positive-definite matrix by writing A > 0 and a 
positive-semidefinite matrix by writing A > 0. Among the several characterizations of positive- 
definite matrices, we note the following. 


Lemma B.2 (Eigenvalues of positive-definite matrices) An n x n Hermitian 


matrix A is positive-definite if, and only if, all its eigenvalues are positive. 


Proof: Let A = UAU'* denote the spectral decomposition of A. Let also u; be the ith column of U with A; 
the corresponding eigenvalue, i.e., Au; = Aju; with ||u, ||? = 1. If we multiply this equality from the left by 
us we get 
uj Aui = Xilluil? = X > 0 

where the last inequality follows from the fact that z* Az > O for any nonzero vector x. Therefore, A > 0 
implies A; > 0. Conversely, assume all A; > O and multiply the equality A = UAU* by any nonzero 
vector z and its conjugate transpose, from right and left, to get z* Ax = x*UAU*z. Now define the matrix 
A12 = diag {VX1, VA2, ..., VAn} and the vector y = A1/?U*z. The vector y is nonzero since U and 
A?/2 are nonsingular matrices and, therefore, the product A1/2U™ cannot map a nonzero vector z to 0. Then 
the above equality becomes z* Az = ||y||? > 0, which establishes that A > 0. 


o 
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In a similar vein, we can show that 


A>0 4 A20 


Note further that since det A = (det U) (det A) (det U*) and (det U) (det U*) = 1, we find that 
det A = det A = [T7 Ài. Therefore, the determinant of a positive-definite matrix is positive, 


A>0 = detA>0 


B.2 RANGE SPACES AND NULLSPACES OF MATRICES 


Let A denote an m x n matrix without any constraint on the relative sizes of n and m. 


Range spaces. The column span or the range space of A is defined as the set of all m x 1 vectors 
q that can be generated by Ap, for all n x 1 vectors p. We denote the column span of A by 


R(A) E {set of all q such that q = Ap forsome p} 


Nullspaces. The nullspace of A is the set of all n x 1 vectors p that are annihilated by A, namely, 
that satisfy Ap = 0. We denote the nullspace of A by 


N(A) Ê {set of all p such that Ap = 0) 


Properties. A useful property that follows from the definitions of range spaces and nullspaces is 
that any vector z from the nullspace of A* (not A) is orthogonal to any vector p in the range space 
of A, i.e., 

z€N(A*), qe R(A) = z'q-0 


Indeed, z € A/(A*) implies that A*z = 0 or, equivalently, z" A = 0. Now write q = Ap for some 
p. Then z*q = z* Ap = 0, as desired. Another useful property is that the matrices A* A and A" have 
the same range space (i.e., they span the same space). Also, A and A* A have the same nullspace. 


Lemma B.3 (Range and nullspaces) For any m x n matrix A, it holds that 


R(A*) = R(A' A) and N(A) = N(A* A). 


Proof: One direction is immediate. Take a vector q € R(A* A), i.e., q = A" Ap for some p. Define r = Ap, 
then g = A*r. This shows that q € R(A*) and we conclude that &(A* A) C R(A*). The proof of the 
converse statement requires more effort. 

Take a vector q € R(A*) and let us show that q € R(A* A). Assume, to the contrary, that q does not lie 
in R(A* A). This implies that there exists a vector z in the nullspace of A* A that is not orthogonal to q, i.e., 
A*Az = 0 and z*q # 0. Now, if we multiply the equality A" Az = 0 by z* from the left we obtain that 
2* A* Az = 0 or, equivalently, |Az||? = 0, where || - || denotes the Euclidean norm of its vector argument. 
Therefore, Az is necessarily the zero vector, Az = 0. But from q € R(A*) we have that q = A*p for some p. 
Then z*q = 2" A*p = 0, which contradicts z*q # 0. Therefore, we must have q € R(A* A) and we conclude 
that R(A*) C R(A* A). 

The second assertion in the lemma is more immediate. If Ap = 0 then A* Ap = 0 so that V(A) C 
A/(A* A). Conversely, if A* Ap = 0 then p* A* Ap = 0 and we must have Ap = 0. That is, V(A* A) C 
(A). Combining both facts we conclude that (A) = N(A* A). 5 


A useful consequence of Lemma B.3 is that linear systems of equations of the form A* Az = A*b 
always have a solution z for any vector b. This is because A" b belongs to 7e (.A") and, therefore, also 


belongs to R(A” A). 


Rank. The rank of a matrix A is defined as the number of linearly independent columns (or rows) 
of A. It holds that 

rank(A) € min{m, n} 
That is, the rank of a matrix never exceeds its smallest dimension. A matrix is said to have full rank 
if 

rank(A) = min{m,n} 
Otherwise, the matrix is said to be rank deficient. 

If A is a square matrix (i.e., m = n), then rank deficiency is also equivalent to a zero determi- 
nant, det A = 0. Indeed, if A is rank deficient, then there exists a nonzero p such that Ap = 0. This 
means that 0 is an eigenvalue of A so that its determinant must be zero — recall that the determinant 
of a square matrix is equal to the product of its eigenvalues (see Prob. II.2). A useful result is the 
following, the proof of which is instructive. 


Lemma B.4 (Invertible product) Let A be N x n, with N > n. Then 
A has full rank <=> A*A is positive-definite 


That is, every tall full rank matrix is such that the square matrix A” A is invertible 
(actually, positive-definite). 


Proof: Let us first show that A has full rank only if A*A is invertible. Thus, assume A has full rank but 
that A* A is not invertible. This means that there exists a nonzero vector p such that A* Ap = 0, which implies 
p* A* Ap = 0 or, equivalently, || Ap||? = 0. That is, Ap = 0. This shows that the nullspace of A is nontrivial so 
that A cannot have full rank, which is a contradiction. Therefore a full rank A implies an invertible matrix A* A. 

Conversely, assume A* A is invertible and let us show that A has to have full rank. Assume not. Then there 
exists a nonzero vector p such that Ap = O. It follows that A* Ap = 0, which contradicts the invertibility of 
A* A. This is because A* Ap = 0 implies that p is an eigenvector of A" A corresponding to the zero eigenvalue. 
Hence, the determinant of A* A is necessarily zero. 

Finally, let us show that A* A is positive-definite. For this purpose, take any nonzero vector x and consider 
the product z* A* Az, which evaluates to || Az||?. Then, the product z* A* Az is necessarily positive; it cannot 
be zero since the nullspace of A, in view of A being full rank, contains only the zero vector. o 


In fact, when A has full rank, not only A*A is positive-definite, but any product of the form 
A* BA for any Hermitian positive-definite matrix B: 


A: Nxn, N2 n, fulrankand B 50 ——  A'BA»0 (B.1) 


To see this, recall from Sec. B.1 that every Hermitian matrix B admits an eigen-decomposition of 
the form 

B = UAU” (B.2) 
where A is diagonal with the eigenvalues of B, and U is a unitary matrix with the orthonormal 
eigenvectors of B. Define the matrices 


AM? & diag { VAr, Va s Ven}, AS APTA 


Then A*BA = A* A. Now the matrix A has full rank if, and only if, A has full rank and, in view of 
the result of the previous lemma, the full rank property of A is equivalent to the positive-definiteness 
of A" A, as desired. 
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B.3 SCHUR COMPLEMENTS 


In this section we assume inverses exist as needed. Thus, consider a block matrix 


aH 


The Schur complement of A in M is denoted by A 4 and is defined as 


Aa Ê D-CA^!B 


Likewise, the Schur complement of D in M is denoted by Ap and is defined as 


Ap £ A-BD-'C 


Block factorizations. In terms of these Schur complements, it is easy to verify by direct calcula- 
tion that the matrix M can be factored in either of the following two useful forms: 


A B I 0 A 0 I AB 
C D CA^ I 0 Aa 0 I 
I BD^ Ap 0 I 0 
0 I 0 D DC I 


Inertia. When M is Hermitian (i.e., M = M"), its eigenvalues are real (cf. Sec. B.1). We then 
define its inertia as the triplet In{ M} = {I+, I-, Io}, where 

I,(M) = the number of positive eigenvalues of M 

I_(M) the number of negative eigenvalues of M 

Io(M) 


the number of zero eigenvalues of M 


Congruence. Given a Hermitian matrix M, the matrices M and QM Q" are said to be congruent 
for any invertible matrix Q. An important result regarding congruent matrices is the following, which 
states that congruence preserves inertia. 


Lemma B.5 (Sylvester's law of inertia) Congruent matrices have identical iner- 
tia, i.e., for any Hermitian M and invertible Q, it holds that In{M} = In{QMQ*}. 


Thus, assume that M has the block Hermitian form 


i 
S 


_| A B ; L at 
u-| 4 2| wih | A-— A', D 


and consider the corresponding block factorizations 


EXICA e ced 


I BD" 
0 I 


where now 
Aa =D-B*A™'B, Ap =A-BD'B" 


The above factorizations have the form of congruence relations so that we must have 


and anny) =ta(| ^e 4 
0 D 


Positive-definite matrices. When M is positive-definite, all its eigenvalues will be positive 
(cf. Sec. B.1). Then, from the above inertia equalities, it will follow that (A, A4, Ap, D) can 
only have positive eigenvalues as well. In other words, it is easy to conclude that 


M»0 4> A»0and AA»0 
M >0 +> D»0and Ap>0 


A 0 
0 Aa 


In(M) = In ( 


Likewise, 


B.4 CHOLESKY FACTORIZATION 


Let us continue to assume that M is positive-definite and let us decompose it now as 
* 
M=] £ b 
b D 


where a is its (0, 0) entry, so that b is a column vector and a is a scalar. The positive-definiteness of 
M guarantees a > 0 and D > 0. Let us further consider the first block factorization shown above 


for M and write 
M = 1 0 a 0 1 b*/a 
b/a I 0 Aa 0 I 


We can rewrite this factorization more compactly in the form 
M = £s | a(0) | re (B.3) 
Ao 


where £o is the lower-triangular matrix 


A 1 0 
SEEN | 


and a(0) = a, Ao = Ag = D — bb* /a. Observe that the first column of Zo is the first column 
of M normalized by the inverse of the (0, 0) entry of M. Moreover, the positive-definiteness of M 
guarantees a(0) > 0 and Ao > 0. Note further that if M is n x n then Ao is (n — 1) x (n — 1). 
Expression (B.3) provides a factorization for M that consists of a lower-triangular matrix fol- 
lowed by a block-diagonal matrix and by an upper-triangular matrix. Now since Ao is in turn 
positive-definite, we can introduce a similar factorization for it, which we denote by 


Ao = Li | a(1) | Li 
Ai 


for some lower-triangular matrix Lı and where a(1) is the (0, 0) entry of Ao. Moreover, A; is the 
Schur complement of a(1) in Ao, and its dimensions are (n — 2) x (n — 2). In addition, the first 
column of Lı coincides with the first column of Ao normalized by the inverse of the (0, 0) entry of 
Ao. Also, the positive-definiteness of Ao guarantees a(1) > 0 and A; > 0. Substituting the above 
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factorization of Ag into the factorization for M we get 


umes]? zi a(1) N |: ME 


But since the product of two lower-triangular matrices is also lower-triangular, we conclude that 


a us 


is lower-triangular, so that we can denote it by £1 and write instead 


a(0) 


Ai 


Clearly, the first column of £1 is the first column of £o and the second column of £; is formed from 
the first column of L1. 


We can proceed to factor A1, which would lead to an expression of the form 
a(0) 
M - £ ecu £e 
A2 


where a(2) > 0 is the (0,0) entry of Ai and A» > 0 is the Schur complement of a(2) in Aj. 
Continuing in this fashion we arrive (after (n — 1) Schur complementation steps) at a factorization 
for M of the form M = Ln-1DL}_1, where Ln-1 is n x n lower-triangular and D is n x n 
diagonal with positive entries {a(i)}. The columns of £5 are the successive leading columns of 
the Schur complements {A;}, normalized by the inverses of the (0, 0) entries of {A;}. The diagonal 
entries of D coincide with these (0, 0) entries of the {Ai}. 


If we define L 2 £, ., D, where D!/? is a diagonal matrix with the positive square-roots of 


the {a(i)}, we obtain 
(lower-upper triangular factorization) 


In summary, this argument shows that every positive-definite matrix can be factored as the product of 
a lower-triangular matrix with positive diagonal entries by its conjugate transpose. This factorization 
is called the Cholesky factorization of M. 


M=|4 ? 
č d 
where d is now a scalar, and had we used the block factorization 


Sit FSI As 1 
~ 10 1 0 d b/d 1 


we would have arrived at a similar factorization for M of the form 


(upper-lower triangular factorization) 


where Ü is an upper-triangular matrix with positive diagonal entries. 


Had we instead partitioned M as 


Lemma B.6 (Cholesky factorization) Every positive-definite matrix M admits a 
unique factorization of either form M = LL' = UU*, where L (U) is a lower- 


triangular (upper-triangular) matrix with positive entries along its diagonal. 


Proof: The existence of the factorizations was proved prior to the statement of the lemma. It remains to estab- 
lish uniqueness. We show this for one of the factorizations. A similar argument applies to the other factorization. 
Thus, assume that 


M = L, Li = L121} (B.4) 

are two Cholesky factorizations for M. Then 
Lz L = L357" (B.5) 
where the compact notation A~* stands for A~* = [A-!]* = [A*]-1. But since the inverse of a lower- 


triangular matrix is lower-triangular, and since the product of two lower-triangular matrices is also lower-triangular, 
we conclude that Lj 1 Di is lower-triangular. Likewise, we can verify that the product L3 Lī * is upper-triangular. 


Therefore, the equality (B.5) will hold if, and only if, Dz Li is a diagonal matrix, which means that 
Lı =L2D (B.6) 


for some diagonal matrix D. We want to show that D is the identity matrix, Indeed, it is easy to see from (B.4) 
that the (0,0) entries of Lı and L2 must coincide so that the leading entry of D must be unity. This further 
implies from (B.6) that the first column of Ly should coincide with the first column of L2, so that using (B.4) 
again we conclude that the (1, 1) entries of Lı and Lz also coincide. Hence, the second entry of D is also unity. 
Proceeding in this fashion we get D = I. o 


Remark B.1 (Evaluation of Cholesky factors) While we obtained the Cholesky factorization of a positive- 


definite matrix by performing a sequence of successive Schur complements, this need not be the preferred method 
numerically for the evaluation of the Cholesky factor. Later in Ch. 34 we shall see that the same Cholesky factor 
can be obtained by applying to the matrix a sequence of unitary rotations, which are better conditioned numeri- 
cally (so that this alternative method of computation is less sensitive to roundoff errors). 


o 


Remark B.2 (Alternative factorization) We also conclude from the discussion in this section that every 
positive-definite matrix M admits a unique factorization of either form M = LDL* = UD,,U*, where L (U) 
is a lower-triangular (upper-triangular) matrix with unit diagonal entries, and D and D, are diagonal matrices 
with positive entries. [Actually, L = £,-1 and D = D] 


o 


B.5 QR DECOMPOSITION 


The QR decomposition of a matrix is a very useful tool and we shall comment on its value and 
convenience on a handful of occasions. It can be motivated as follows. 

Consider a collection of n column vectors, (hi, i = 0,1,...,n — 1}, of dimensions N x 1 
each with N 2 n. These vectors can be converted into an equivalent set of orthonormal vectors 
(qi, 7 = 0,1,...,n — 1}, which span the same linear subspace as the {hi}, by appealing to the 
classical Gram-Schmidt procedure. This is an iterative procedure that operates as follows. It starts 
with go = ho/||hol|, where ||ho|| denotes the Euclidean norm of ho, and then repeats for i > 0: 


i-1 


ri = hi- DD (Ghi) gs q =Ti/\lrill 


j=0 


Thus, observe, for example, that rı = hı — (qåh1ı ) qo, which is simply the residual vector that results 
from projecting hı onto qo. Clearly, by the orthogonality property of least-squares solutions, this 
residual vector is orthogonal to go. By normalizing its norm to unity and by defining qı = r1/||r1||, 
we end up with a unit-norm vector that is orthogonal to go. In this way, we would have replaced 
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the original vectors (ho, h1} by two orthonormal vectors (go, q1). More generally, for any i, the 
vector r; is the residual that results from projecting h; onto the successive orthonormal vectors 


(90,q1;. 3. Qi-1]- 


Now we can express each column ^; as a linear combination of (qo, q1, . . ., qi) as follows: 
hi = (gohi)qo + (qihi)m +...  (ai-ihi)gi-i + |lrill ai (B.7) 
If we collect the {h;} into a matrix H = [ho hi ... hn-1], and if we collect the coefficients of the 
linear combinations (B.7), for all i = 0,1,...,n — 1, into an n x n upper-triangular matrix R, 
lroll goha óha ... qgöhn-ı 
rall gia... qiha-i 
R= ; 
Ilrn—ai 


we find that we can express the decomposition (B.7) in matrix form as follows: 


where © is N x n with orthonormal columns {qi}, i.e., Q = [do qı ... q»-1] with gjqi = ài. 
When H has full column-rank, i.e., when H has rank n, all the diagonal entries of R will be posi- 
tive. The factorization H = QR is referred to as the reduced QR decomposition of the matrix H; 
in effect, the QR decomposition of a matrix simply amounts to the orthonormalization of its columns. 


Definition B.1 (Reduced QR decomposition) Given an N x n, n € N, matrix 
Hz | ho hi ... hp | It can be decomposed as H = OR, where R is 


n x n and upper triangular and Q is N x n with orthonormal columns. 


It is often convenient to employ the full QR decomposition of H, as opposed to its reduced QR 
decomposition. The full decomposition is obtained by appending N — n orthonormal columns to Q 
so that it becomes a unitary N x N (square) matrix Q. Correspondingly, we also append rows of 
zeros to R so that (B.8) becomes 


(B.9) 


where Q — (Q Qn -.. QN-4]. 


Definition B.2 (Full QR decomposltlon) Given an N x n, n € N, matrix H — 


ho hi ... hqa-3 |. It can be decomposed as in (B.9), where R is n x n and 
upper triangular and Q is N x N and unitary. 


B.6 SINGULAR VALUE DECOMPOSITION 


The singular value decomposition of a matrix (SVD, for short) is another powerful tool that is useful 
for both analytical and numerical purposes. It enables us to represent any matrix (square or not, 
invertible or not, Hermitian or not) as the product of three matrices with special and desirable prop- 


erties: two of the matrices are unitary (orthogonal in the real case) and the third matrix is composed 
of a diagonal matrix and a zero block. 

Specifically, the SVD of a matrix A states that if A is n x m then there exist an n x n unitary 
matrix U (UU* = I), an m x m unitary matrix V (VV* = I), and a diagonal matrix X with 
nonnegative entries such that: 

(i) If n € m, then X is n x n and 


A=u| »» o | vs, A:nxm, n<m 


(ii) If n > m, then X is m x m and 


Observe that U and V are square matrices, while the central matrix has the dimensions of A. The 
diagonal entries of © are called the singular values of A and are usually ordered in decreasing order, 
say, X = diag{oi,02,...,0p,0,...,0} with c1 > o2 >... > op > 0. If X has p nonzero 
diagonal entries then A has rank p. The columns of U and V are called the left- and right-singular 
vectors of A, respectively. 


Constructive Proof of the SVD 


One proof of the SVD decomposition follows from the eigen-decomposition of a Hermitian nonnegative- 


definite matrix. The argument given here assumes n € m, but it can be adjusted to handle the case 
n>m. 

Note that AA* is a Hermitian nonnegative-definite matrix and, consequently, there exists an n x n 
unitary matrix U and an n x n diagonal matrix X?, with nonnegative entries, such that AA" = 
UX?U*. This representation simply corresponds to the eigen-decomposition of AA*. The diagonal 
entries of X? are the eigenvalues of AA", which are nonnegative (and, hence, the notation 7); 
they are also equal to the nonzero eigenvalues of A*A. The columns of U are the corresponding 
orthonormal eigenvectors. By proper reordering, we can always arrange the diagonal entries of X? 
in decreasing order so that =? can be put into the form X? = diagonal (02,02,...,02,0,...,0], 
where p = rank(AA”*) and 0226022..2 of > 0. If we define the m x m matrix 


Vi = A"U diag{o;",...,05°,0,...,0} 


then it is immediate to conclude that A = UXVA*, and that V; satisfies 


- I0 
vni J 


But the columns of V; can be completed to a full unitary basis V of an m-dimensional space, say, 
V = | VV , such that VV* = V*V = I. This allows us to conclude that we can write 


A-U|X 0 | V*. which is the desired SVD of A in the n € m case. A similar argument 
establishes the SVD decomposition of A in the n > m case. 


Spectral Norm of a Matrix 


Assume n > m and consider the m x m square matrix A* A. Using the Rayleigh-Ritz characteriza- 
tion of the eigenvalues of a matrix from Sec. B.1, we have that 


* * 2 
Aust Sane | Ed AR a ras (el 
USA ota AW 
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But since c? = \max(A*A), we conclude that the largest singular value of A satisfies: 
ei = max (IiAzil/lizll) 


In other words, the square of the maximum singular value, c2, measures the maximum energy gain 
from x to Az. The same conclusion holds when n € m since o? = Amax(A*A) as well, and 
the argument can be repeated. In addition, if we select x = v; (ie., as the right singular vector 
corresponding to c1), then this choice for z achieves the maximum gain since Avi = ocu; and, 
therefore, 

Anil? loul? _ 22 

I|vi ||? lul? — 7 


The maximum singular value of a matrix is called its spectral norm or its 2-induced norm, written as 


A 


lall / iels 


This definition can be taken as a matrix norm because it can be verified to satisfy the properties of a 
norm. Specifically, a matrix norm || - || should satisfy the following properties, for any matrices A 
and B and for any complex scalar a: 


1, j|jAl| > 0 always and || A|| = 0 if, and only if, A = 0. 
2. ||eA|| = lo] - || Al]. 

3. Triangle inequality: ||A + B|| € !|Al| + ||B |. 

4. Submultiplicative property: || AB] € ||A|| - || B]]. 


Pseudo-inverses 
The pseudo-inverse of a matrix is a generalization of the concept of inverses for square invertible 
matrices; it is defined for matrices that need not be square or even invertible, 

Given an n x m matrix A, its pseudo-inverse is defined as the m x n matrix A! that satisfies the 
following four requirements: 


The SVD of A can be used to determine its pseudo-inverse as follows. Introduce the matrix E! = 
diagonal {07 ,02,...,05 .,0,...,0). That is, we invert the nonzero entries of X and keep the 
zero entries unchanged. 


(i) When n € m, we define 


(A:nxm, A :mxn, n € m) 


(i) When n > m, we define 


A! «V | xt ou (A:nxm, A! :mxn, n2 m) 


It can be verified that this A? satisfies the four defining properties (1)- (iv) listed above. In addition, 
it can be verified, by replacing A by its SVD in the expressions below, that when A has full rank, its 
pseudo-inverse is given by 

1. when n € m, At = A'(AA")7!. 

2. when n > m, At = (A* A) 1 A*. 


Minimum Norm Solution of Least-Squares Problems 
A useful application of the pseudo-inverse of a matrix is in the characterization of the minimum norm 
solution of a least-squares problem with infinitely many solutions. 


Lemma B.7 (Minimum norm solution) Consider a least-squares problem of the 
form 
min {ly — Hw? 
w 


with possibly an infinite number of solutions {@}. The least-norm solution among 
these (i.e., the solution @ with the smallest Euclidean norm) is unique and is given 
by @ = Hy. In other words, this particular @ is the unique solution to the following 
optimization problem: 


min lw|| where W = {w such that ||y — Hw||? is minimum) 


Proof: Let N x M denote the dimensions of H. We establish the result for the case N > M (i.e., for the 
over-determined case) by using the convenience of the SVD representation. A similar argument applies to the 
under-determined case. Thus, let p < M denote the rank of H and introduce its SVD, say, 


H=U 2 y* 
0 
with X = diag{oi,.... Tp, 0,...,0}. Then 


ly- Hw? = JU*y-U* HYV*w|? = 


r-| |z 


where we introduced the vectors z = V*w and f = U*y. Therefore, the problem of minimizing |y — 
H w|}? over w is equivalent to the problem of minimizing the rightmost term over z. Let {z(i), f(i)} denote the 
individual entries of {z, f. Then 


x 2 p N41 
|r- | IE = O-H + YS FO? 
0 i=l i=p+1 


The second term is independent of z. Hence, any solution z has to satisfy z(¢) = f(i)/o; for i = 1 to p and 
z(i) arbitrary fori = p+ 1 toi = M. The one with the smallest Euclidean norm requires that these latter values 
be chosen as zero. In this case, the minimum norm solution becomes 


Q = Veool{f(1)/o1,....f(p)/op,0,...,0} = vi si o ]ev = Hty 


as desired, 


B.7 KRONECKER PRODUCTS 


Let A = [ai]. and B = [bi;]?.;.1 be m x m and n x n matrices, respectively. Their Kronecker 
product (also called their tensor product) is denoted by A & B and is defined as the mn x mn matrix 
whose entries are given by (see, e.g., Horn and Johnson (1994)): 


aii B a12B 
a33B  az3B 


AGB = 


GmiB am2B 


In other words, each entry of À is replaced by a scaled multiple of B. In particular, if A is the identity 
matrix, Im, then Im & B is a block diagonal matrix with B repeated along its diagonal: 


Im @ B = diag(B, B,..., B) 
Ne 


m times 
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On the other hand, A @ I, is not a block diagonal matrix. For example, if m = 2 = n and B = I? 
then 


AÌ = 


One of the main uses of Kronecker products, at least for our purposes, is that they allow us to 
replace matrix operations by vector operations, as we saw repeatedly in the body of the chapter. The 
following is a list of some of the useful properties of Kronecker products, with property (vi) being 
the one that we used the most in our development. 


Lemma B.8 (Useful properties) Consider mxm and nxn matrices A and B and 
let {a;i = 1,..., m) and (8;, j = 1,..., n} denote their eigenvalues, respectively. 
The matrices may be real or complex-valued. Then it holds that: 


(i) (48 B)(C 8 D) = AC 8 BD. 


(ii) If A and B are invertible matrices, then (A & B)^! = A^! & B^ !; observe 
that the order of the matrices A and B is not switched. 


(iii) (A & B) has mn eigenvalues and they are equal to all combinations {aiG,}, 
for i —1,...,m and j = 1,...,n. 


(iv) Tr(A & B) = Tr(A)TI(B). 
(v) det(A & B) = (det A)" (det B)". 
(vi) For any matrices (A, B, C, X) of compatible dimensions, if C — AX B, then 
vec(C) = (B' & A)vec(X]. 
(vii) (A & B)' = A" & B' as well as (A@ B)* = A* & B*. 
ESSA '—: 


Proof: Part (i) follows by direct calculation from the definition of Kronecker products. Part (ii) follows by 
using part (i) to note that (A @ B)(A-! @ B71) = 1@1 =I. Part (iii) follows from part (i) by choosing C as 
a right eigenvector for A and D as a right eigenvector for B, say, C = q; and D = p; where Aq; = aiqi and 
Bp; = jpj. Then 

(A & B)(ai & pj) = a1 8; (Gi Q pj) 
which shows that (q; & p;) is an eigenvector of (A & B) with eigenvalue a;3;. Part (iv) follows from part (iii) 


since 
n 


T(4)2 So ai, THB) = 5° 8; 
i=1 


j=l 
and, therefore, 


Tr(A)Tr(B) = P» 2 (X s) = > 5 oiB; 


Part (v) also follows from part (iii) since 


det(A & B) = Il Il oig; = (fi «) . n s) = (det A)" . (det B)™ 
r 


i=1j=1 i=1 


Part (vi) follows from the definition of Kronecker products and from noting that the vec representation of the 
rank one matrix ab" is simply b @ a, i.e., vec(abT) = b @ a. Finally, part (vii) follows from the definition of 
Kronecker products. } 


CHAPTER C 


Complex Gradients 


l. this chapter we explain how to differentiate a scalar-valued function g(z) with respect to a 
complex-valued argument, z, and its complex conjugate, z". The argument z could be either a 
scalar or a vector. 


C.1 CAUCHY-RIEMANN CONDITIONS 


We start with a scalar argument z = x + jy, where j = \/—1. In this case, we can regard g(z) as a 
function of the two real scalar variables, x and y, say, 


glz) = ulx,y) + juz, y) (C.D 


with u(-,-) denoting its real part and v(^, -) denoting its imaginary part. Now, from complex function 
theory, the derivative of g(z) ata point zo = £o + jyo is defined as (see, e.g., Ahlfors (1979): 


dg ^ qum g(zo + Az, Yo + Ay) — g(zo. Yo) 
dz Az—0 Az + jAy 


where Az = Az + jAy. For g(z) to be differentiable at zo, in which case it is also said to be 
analytic at zo, the above limit should exist regardless of the direction from which z approaches zo. 
In particular, if we assume Ay = 0 and Az — 0, then the above definition gives 


dg _ du | ov 
dz ðr “dx 
If, on the other hand, we assume that Az = 0 and Ay — 0 so that Az = jAy, then the definition 
gives 


(C.2) 


dg _ dv ðu 


qe Ou Igy (C.3) 
The expressions (C.2) and (C.3) should coincide. Therefore, by adding them we get 
dg 1 /ðu „ðv Ov Ou 
dz 2\d2 "70x" dy Iy 
or, more compactly, 
dg a 1 f Og . 0g 
dz 2 E I By (ER 


Observe that the equality of expressions (C.2) and (C.3) implies that the real and imaginary parts of 
g(-) should satisfy the conditions 


du _ 80 ma Ou _ ov 
Ox” Oy Oy ðr 
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which are known as the Cauchy-Riemann conditions. It can be shown that these conditions are not 
only necessary for a complex function g(z) to be differentiable at z = zo, but if the partial derivatives 
of u(-,-) and v(-, -) are continuous, then they are also sufficient. 


C.2 SCALAR ARGUMENTS 


More generally, if g is a function of both z and z*, we define its partial derivatives with respect to z 
and z* as follows: 


ð 2 ðôz*  2\ôðr d Oy e) 


AN Tm ym) Ee Eee 5 


s: Oy 


Note, in particular, from the Cauchy-Riemann conditions, that if g(z) is an analytic function of z 
then it must necessarily hold that 0g/Oz* = 0. 


Examples 
We illustrate the definitions (C.4)-(C.5) considering several examples. 


1. Let g(z) =z = x + jy. Then 


8g 


=(1-?)/2= 9g . igo 
58; ^ 0 j 221, g "030/220 


2. Let g(z) = 2? = (x + jy)(z + jy) = (z? — y?) + j2xy. Then 


ag 
Oz 


In Examples 1 and 2, since g is a function of z alone, it holds that 0g/0z = dg/dz. 


= 2(x + jy) = 2z, 29 o 


3. Let g(z) = |z|? = zz* = (x + jy)(z — jy) = z? + y^. Then 
Og ,.. wu. OG EN 
jp WMHs, gr tjus? 


4. Let g(z) = à + az + Bz* + *yzz" , where (A, a, 3, y) are complex constants. That is, 


g(z) = [A + az + Beta? + y^)] +3 (ev — By] 


Then 8 8 
99 . i 9 
a, Ot: auct 
C.3 VECTOR ARGUMENTS 
Now assume that z is a column vector, say, 
z = cOl{z1,22,...,2n}, Zi =£; + jyi 


The complex gradient of g with respect to z is denoted by V zg(z), or simply V +g, and is defined as 
the row vector 


z is a column 


â 
Vig = | 89/0; 0g/8z2 ae 0g/Ozn | { Yag is a row 


Likewise, the complex gradient of g with respect to z* is defined as the column vector 


z* is a TOW 
V 2g is a column 


The reason why we choose to define Vg as a row vector and V ;*g as a column vector is because 
the subsequent differentiation results will be consistent with what we are used to from the standard 
differentiation of functions of real-valued arguments. Let us again consider a few examples: 

1. Let g(z) = a*z, where (o, z} are column vectors. Then V.g = a* and V;*g = 0. 

2. Let g(z) = z" B, where (8, z} are column vectors. Then Vg = 0 and V.+g = £. 
3. Let g(z) = 2*z, where z is a column vector. Then Vzg = 2* and V;*g = z. 
4 


. Let g(z) =A +a*z4+ z" B + z'Uz, where A is a scalar, (o, 3} are column vectors and T is 
a matrix. Then V.g = a* + z'T and V;g = 8 --Yz. 


Hessian Matrix 
The complex gradient of V +g with respect to z* is called the Hessian matrix of g, and it is denoted 
by 


92g 0?g 0g 

Ozt0z,  Ozt0z; `  OztOzn 
Og 03g 05g 

Vg Ê ees ðzžðzı O23822 '""  Oz10z. 
8?g a9 2g 

Ozt0z;  Ozz0za ``  OztOza 


It holds that V;« [V9] = V [Vz+g]. Consider again the example g(z) = A +a*z +z*8 + z*Tz, 
then its Hessian matrix is V2 g — T. 


Real-Valued Arguments 
When z is real, say, z = x, and g(z) = A -- o! z + x18 + x Tz, with z = col(zi,22,...,24) 
and a symmetric matrix T, then the gradient of g with respect to x is again defined as 


Vag È | 0g/óm: ðg/ðr2 ... 8g/den | 


so that we now obtain Vzg = a + BT --2z'T. Likewise, the Hessian matrix becomes V2 g=. 
Observe the difference from the complex case (in terms of an additional scaling factor that is equal 
to 2). This is because, in the complex case, the symbols (2, z* } are treated as separate variables. 
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CHAPTER 1 


Scalar-Valued Data 


l. this first part of the book we focus on the basic, yet fundamental, problem of estimat- 
ing an unobservable quantity from a collection of measurements in the least-mean-squares 
sense. The estimation task is made more or less difficult depending on how much infor- 
mation the measured data convey about the unobservable quantity. We shall study this 
estimation problem with increasing degrees of complexity, starting from a simple scenario 
and building up to more sophisticated cases. 

The material is developed initially at a slow pace. This is done deliberately in order to 
familiarize readers (and especially students) with the basic concepts of estimation theory 
for both real- and complex-valued random variables, as well as for scalar- and vector- 
valued random variables. We hope that, by the end of our exposition, the reader will be 
convinced that these different scenarios (of real vs. complex and scalar vs. vector) can be 
masked by adopting a uniform vector and complex-conjugation notation. The notation 
is introduced gradually in the two initial chapters and will be used throughout the book 
thereafter. 

Before plunging into a study of least-mean-squares estimation theory, and the reasons 
for its widespread use, the reader is advised to consult the review material in Secs. A.1-A.4. 
These sections provide an intuitive explanation for what the variance of a random variable 
means. The sections also introduce several useful concepts such as complex- and vector- 
valued random variables and the notions of independence and uncorrelatedness between 
two random variables. The explanations will help the reader appreciate the value of the 
least-mean-squares criterion, which is used extensively in later sections and chapters. 


1.1 ESTIMATION WITHOUT OBSERVATIONS 


We initiate our discussions of estimation theory by posing and solving a simple (almost 
trivial) estimation problem. Thus, suppose that all we know about a real-valued random 
variable x is its mean Z and its variance c2, and that we wish to estimate the value that 
x will assume in a given experiment. We shall denote the estimate of x by 2; it is a 
deterministic quantity (i.e., a number). But how do we come up with a value for 2? And 
how do we decide whether this value is optimal or not? And if optimal, in what sense? 
These inquiries are at the heart of every estimation problem. 

To answer these questions, we first need to choose a cost function to penalize the esti- 
mation error. The resulting estimate ĉ will be optimal only in the sense that it leads to the 
smallest cost value. Different choices for the cost function will generally lead to different 
choices for Z, each of which will be optimal in its own way. 

The design criterion we shall adopt is the mean-square-error criterion. It is based on 
introducing the error signal 


lI 
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and then determining 2 by minimizing the mean-square-error (m.s.e.), which is defined as 
the expected value of #7, i.e., 
min E 3? (1.1) 
z 


The error x is a random variable since x is random. The resulting estimate, 2, will be 
called the least-mean-squares estimate of x. The following result is immediate (and, in 
fact, intuitively obvious as we explain below). 


Lemma 1.1 (Lack of observations) The least-mean-squares estimate of x give 
2 


knowledge of (Z,02) is 2 = z. The resulting minimum cost is Ez? = c?. 


Proof: Expand the mean-square error by subtracting and adding Z as follows: 


E’ = E(x — ĉ)? = E [(æ — 2) + (z — ê)? = o2 + (2 - 2}? 

The choice of 2 that minimizes the m.s.e. is now evident. Only the term (Z — ĉ)? is dependent on 
$ and this term can be annihilated by choosing $ = Z. The resulting minimum mean-square error 
(m.m.s.e.) is then 


^ Ez 
m.m.se, = EG? = c? 


An alternative derivation would be to expand the cost function as 
E(r-4)-Esz?-2z2à4 


and to differentiate it with respect to 2. By setting the derivative equal to zero we arrive at the same 
conclusion, namely, @ = i. 
o 


There are several good reasons for choosing the mean-square-error criterion (1.1). The 
simplest one perhaps is that the criterion is amenable to mathematical manipulations, more 
so than any other criterion. In addition, the criterion is essentially attempting to force the 
estimation error to assume values close to its mean, which happens to be zero. This is 
because 

E$-—-E(r-i)-E(r—-3)-2—-z-0 


and, by minimizing EZ”, we are in effect minimizing the variance of the error, 2. In 
view of the discussion in Sec. A.1 regarding the interpretation of the variance of a random 
variable, we find that the mean-square-error criterion is therefore attempting to increase 
the likelihood of small errors. 

The effectiveness of the estimation procedure (1.1) can be measured by examining the 
value of the minimum cost, which is the variance of the resulting estimation error. The 
above lemma tells us that the minimum cost is equal to gs. That is, 


2:12 
Ug — Or 


so that the estimate $ = Z does not reduce our initial uncertainty about æ since the error 
variable still has the same variance as æ itself! We thus find that the performance of the 
mean-square-error design procedure is limited in this case. Clearly, we are more interested 
in estimation procedures that result in error variances that are smaller than the original 
signal variance. We shall discuss one such procedure in the next section. 

The reason for the poor performance of the estimate ? = Z lies in the lack of more 
sophisticated prior information about æ. Note that Lemma 1.1 simply tells us that the best 


we can do, in the absence of any other information about a random variable æ, other than 
its mean and variance, is to use the mean value of x as our estimate. This statement is, 
in a sense, intuitive. After all, the mean value of a random variable is, by definition, an 
indication of the value that we would expect to occur on average in repeated experiments. 
Hence, in answer to the question: what is the best guess for æ?, the analysis tells us that 
the best guess is what we would expect for x on average! This is a circular answer, but one 
that is at least consistent with intuition. 


Example 1.1 (Binary signal) 


Assume æ represents a BPSK (binary phase-shift keying) signal that is equal to +1 with probability 
1/2 each. Then 


and 
ci-Ez^!-1 

Now given knowledge of (z, 02) alone, the best estimate of æ in the least-mean-squares sense is 
T = Z = Q. This example shows that the least-mean-squares (and, hence, optimal) estimate does not 
always lead to a meaningful solution! In this case, 2 = Q is not useful in guessing whether æ is 1 
or —1 in a given realization. If we could incorporate into the design of the estimator the knowledge 
that z is a BPSK signal, or some other related information, then we could perhaps come up with a 
better estimate for x. 

o 


1.2 ESTIMATION GIVEN DEPENDENT OBSERVATIONS 


So let us examine the case in which more is known about a random variable æ, other than 
its mean and variance. Specifically, let us assume that we have access to an observation of 
a second random variable y that is related to x in some way. For example, y could be a 
noisy measurement of x, say, y = x + v, where v denotes the disturbance, or y could be 
the sign of æ, or dependent on æ in some other manner. 

Given two dependent random variables {x, y}, we therefore pose the problem of deter- 
mining the least-mean-squares estimator of x given y. Observe that we are now employing 
the terminology estimator of x as opposed to estimate of x. In order to highlight this dis- 
tinction, we denote the estimator of z by the boldface notation d; it is a random variable 
that is defined as a function of y, say, 


& = h(y) 


for some function A(-) to be determined. Once the function h(-) has been determined, 
evaluating it at a particular occurrence of y, say, for y = y, will result in an estimate for 
g£, i.e., 


Different occurrences for y lead to different estimates ?. In Sec. 1.1 we did not need to 
make this distinction between an estimator & and an estimate 7. There we sought directly 
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an estimate 2 for æ since we did not have access to a random variable y; we only had 
access to the deterministic quantities (7, o2). 

The criterion we shall use to determine the estimator ĉ is still the mean-square-error 
criterion. We define the error signal 


&5sr-à& (1.2) 


and then determine $ by minimizing the mean-square-error over all possible functions 


h(-): 
min EZ? 
d 


The solution is given by the following statement. 


Theorem 1.1 (Optimal mean-square-error estimator) The least-mean-squares 
estimator (I.m.s.e.) of z given y is the conditional expectation of z given y, 
i.e., 2 = E(aly). The resulting estimate is 


$-E(rzy-y)- [ Lfzly(z|y)dx 


x 


where S, denotes the support (or domain) of the random variable æ. More- 
over, the estimator is unbiased, i.e., E? = Z, and the resulting minimum cost 
is Ez? = o? — c2. 


Proof: There are several ways to establish the result. Our argument is based on recalling that for 
any two random variables æ and y, it holds that (see Prob. 1.4): 


Ez = E[E(xiy)] (1.4) 


where the outermost expectation on the right-hand side is with respect to y, while the innermost 
expectation is with respect to æ. We shall indicate these facts explicitly by showing the variables 
with respect to which the expectations are performed, so that (1.4) is rewritten as 


Ea = E,[E«(zly)] 
It now follows that, for any function of y, say, g(y), it holds that 
Ez, zg(y) = Ey [Ez(zo(y)ly)] = Ev [Ez(zlv)g(v)] = Ez, [Ez(zlv)] (v) 


This means that, for any g(y), it holds that E z, [z — Ez (z|y)] (y) = 0, which we write more 


compactly as 
E [x - E(zly)] 9(y) = 0 (1.5) 


Expression (1.5) states that the random variable x — E (æ|y) is uncorrelated with any function g(-) 
of y. Indeed, as mentioned before in Sec. A.2, two random variables a and b are uncorrelated if, and 
only if, their cross-correlation is zero, i.e., E (a — à)(b — b) = 0. On the other hand, the random 
variables are said to be orthogonal if, and only if, Eab = 0. It is easy to verify that the concepts 
of orthogonality and uncorrelatedness coincide if at least one of the random variables is zero mean. 
From equation (1.5) we conclude that the variables z — E (a|y) and g(y) are orthogonal. However, 
since z — E (a|y) is zero mean, then we can also say that they are uncorrelated. 


Using this intermediate result, we return to the cost function (1.3), add and subtract E (|y) to its 
argument, and express it as 


E(z-£) = E(x — E (aly) + E(aly) - à]? 


The term E (z|y) — & is a function of y. Therefore, if we choose g(y) = E (x|y) — 2, then from 
the orthogonality property (1.5) we conclude that 


E(z-&) =E[a —E(aly))” + ElE (aly) - ê]? 


Now only the second term on the right-hand side is dependent on ĉ and the m.s.e. is minimized by 
choosing @ = E (a|y). To evaluate the resulting m.m.s.e. we first note that the optimal estimator is 


unbiased since 
E? = E[E(aly)] = Er = € 


and its variance is therefore given by o2 = E? — Z°. Moreover, in view of the orthogonality 
property (1.5), and in view of the fact that 2 = E (a|y) is itself a function of y, we have 


E(r-4&)&-0 (1.6) 


In other words, the estimation error, Z, is uncorrelated with the optimal estimator. Using this result, 
we can evaluate the m.m.s.e. as follows: 


E#? = E[r-ái]|r-4£| = Ele -ale (because of (1.6)) 
Ez? — E&[$ +4] 
Ez? — Ea? (because of (1.6)) 


= (Ez? - z)-«(z-Ea)-oci-oi 


Il 


o 


Theorem 1.1 tells us that the least-mean-squares estimator of æ is its conditional expec- 
tation given y. This result is again intuitive. In answer to the question: what is the best 
guess for x given that we observed y?, the analysis tells us that the best guess is what we 
would expect for x given the occurrence of y! 


Example 1.2 (Noisy measurement of a binary signal) 


Let us return to Ex. 1.1, where a is a BPSK signal that assumes the values +1 with probability 
1/2. Assume now that in addition to the mean and variance of c, we also have access to a noisy 
observation of æ, say, 

y=x+v 
Assume further that the signal æ and the disturbance v are independent, with v being a zero-mean 
Gaussian random variable of unit variance, i.e., its pdf is given by 


Oe oe UL 


Var 


Our intuition telis us that we should be able to do better here than in Ex. 1.1. But beware, even here, 
we shall make some interesting observations. 


According to Thm. 1.1, the optimal estimate of a given an observation of y is 


# = E(ely =y) = J 7 afew (oly)de (1.7) 


—o 
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We therefore need to determine the conditional pdf, fz|,(a|y), and evaluate the integral (1.7). For 
this purpose, we start by noting, from probability theory, that the pdf of the sum of two independent 
random variables, namely, y = x + v, is equal to the convolution of their individual pdfs, i.e., 


fuly) = f ^ fe(z)fe(y — 2)de 


In this example, we have 
1 1 
f(z) = 50(2 — 1) + 56(z +1) 
where ó(-) is the Dirac-delta function, so that fy (y) is given by 
1 1 
fuly) = gfelyt1) + sfoly—-1) (1.8) 


Moreover, the joint pdf of (z, y} is given by 


foy(z,y) = fe(z):fus(vlz) 
= [Joe-0 + 5ae+0)] AW -2) 
= gh 6 - 1) 5 fol + 1)6(e + 1) 
Using (A.6) we get 
fau) = fea _ fiu- DAe- 1), _folv +524) 
fu(y) f«(y--1)*feéy-1) fy+)+hy-Y 
Substituting into expression (1.7) for and integrating we obtain 
Ja fely — 1) - fe(y 1) 
fe(y- 1) fe(y-1)  fs(y-- 1)  fe(y — 1) 
1 1 e" —e v 


A 
a ——n A ———— h 
m mo ey pers c E 
ceumm]*t! (semma) 


In other words, the least-mean-squares estimator of x is the hyperbolic tangent function, 
& = tanh(y) (1.9) 


The result is represented schematically in Fig. 1.1. 

Figure 1.2 plots the function tanh(y). We see that it tends to +1 as y — too. For other 
values of y, the function assumes real values that are distinct from +1. This is a bit puzzling from 
the designer's perspective. The designer is interested in knowing whether the symbol æ is +1 or 
—1 based on the observed value of y. The above construction tells the designer to estimate a by 
computing tanh(y). But this value will never be exactly --1 or —1; it will be a real number inside 


FIGURE 1.1 Optimal estimation of a BPSK signal embedded in unit-variance additive Gaussian 
noise. 


Optimal decision device for BPSK data buried in Gaussian noise 


-3 -2 -1 0 1 2 3 
ti 


FIGURE 1.2 A plot of the function tanh(y). 


the interval (—1, 1). The designer will then be induced to make a hard decision of the form: 


decide in favor of +l ita " nonnegative 
—1 if Z is negative 


In effect, the designer ends up implementing the alternative estimator: 
$ = sign[tanh(y)] (1.10) 


where sign(-) denotes the sign of its argument; it is equal to +1 if the argument is nonnegative and 
—1 otherwise. 

We therefore have a situation where the optimal estimator, although known in closed form, does 
not solve the original problem of recovering the symbols +1 directly. Instead, the designer is forced 
to implement a suboptimal solution; it is suboptimal from a least-mean-squares point of view. Even 
more puzzling, the designer could consider implementing the alternative (and simpler) suboptimal 
estimator: 

$ — sign(y) (1.11) 
where the sign(-) function operates directly on y rather than on tanh(y) — see Fig. 1.3. Both sub- 
optimal implementations (1.10) and (1.11) lead to the same result since, as is evident from Fig. 1.2, 
sign[tanh(y)| = sign(y). In the computer project at the end of this part we shall compare the 
performance of the optimal and suboptimal estimators (1.9)-(1.11). 


SINE 


FIGURE 1.3 Suboptimal estimation of a BPSK signal embedded in unit-variance additive 
Gaussian noise. 


We may mention that in the digital communications literature, especially in studies on equaliza- 
tion methods, an implementation using (1.11) is usually said to be based on kard decisions, while an 
implementation using (1.9) is said to be based on soft decisions. 

o 


Remark 1.1 (Complexity of optima! estimation) Example 1.2 highlights one of the inconve- 
niences of working with the optimal estimator of Thm. 1.1. Although the form of the optimal solution 
is given explicitly by $ = E (aly), in general it is not an easy task to find a closed-form expression 
for the conditional expectation of two random variables (especially for other choices of probability 
density functions). Moreover, even when a closed-form expression can be found, one is usually led 
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to a nonlinear estimator whose implementation may not be practical or may even be costly. For this 
reason, from Part II (Linear Estimation) onwards, we shall restrict the class of estimators to linear 
estimators, and study the capabilities of these estimators. 


o 


The purpose of Exs. 1.1 and 1.2 is not to confuse the reader, but rather to stress the fact 
that an optimal estimator is optimal only in the sense that it satisfies a certain optimality 
criterion. One should not confuse an optimal guess with a perfect guess. One should also 
not confuse an optimal guess with a practical one; an optimal guess need not be perfect or 
even practical, though it can suggest good practical solutions. 


1.3 ORTHOGONALITY PRINCIPLE 


There are two important conclusions that follow from the proof of Thm. 1.1, namely, the 
orthogonality properties (1.5) and (1.6). The first one states that the difference 


x — E (zly) 


is orthogonal to any function g(-) of y. Now since we already know that the conditional 
expectation, E (aly), is the optimal least-mean-squares estimator of x, we can restate this 
result by saying that the estimation error x is orthogonal to any function of y, 


Ez g(y) - 0 (1.12) 


We shali sometimes use a geometric notation to refer to this result and write instead 


a» 


where the symbol | is used to signify that the two random variables are orthogonal; a 
Schematic representation of this orthogonality property is shown in Fig. 1.4. 

Relation (1.13) admits the following interpretation. It states that the optimal estimator & = 
E (z|y) is such that the resulting error, &, is orthogonal to (and, in fact, also uncorrelated 
with) any transformation of the data y. In other words, the optimal estimator is such that 
no matter how we modify the data y, there is no way we can extract additional information 
from the data in order to reduce the variance of à any further. This is because any additional 
processing of y will remain uncorrelated with z. 

The second orthogonality property (1.6) is a special case of (1.13). It states that 


$lé$ 


g(y) 


FIGURE 1.4 The orthogonality condition: 1 g(y). 


That is, the estimation error is orthogonal to (or uncorrelated with) the estimator itself. This 
is a special case of (1.13) since & is a function of y by virtue of the result & = E (aly). 

In summary, the optimal least-mean-squares estimator is such that the estimation error 
is orthogonal to the estimator and, more generally, to any function of the observation. It 
turns out that the converse statement is also true so that the orthogonality condition (1.13) 
is in fact a defining property of optimality in the least-mean-squares sense. 


| Theorem 1.2 (Orthogonality condition) Given two random variables x and 
| y. an estimator & = (y) is optimal in the least-mean-squares sense (1.3) if, 
| and only if, £ is unbiased (i.e., EZ = Z) and x — $ L g(y) for any function 
gt). 


Proof: One direction has already been proven prior to the statement of the theorem, namely, if & 
is the optimal estimator and hence, 2 = E (xy), then we already know from (1.13) that 2 L g(y), 
for any g(-). Moreover, we know from Thm. 1.1 that this estimator is unbiased. 

Conversely, assume & is some unbiased estimator for a and that it satisfies x — @ L g(y), for 
any g(-). Define the random variable z = $ — E (|y) and let us show that it is the zero variable 
with probability one. For this purpose, we note first that z is zero mean since 


Ez=E@ - E(E(ajy)) = 2-2 = 0 


Moreover, from (1.5) we have x — E(z|y) L g(y) and, by assumption, we have æ — $ L g(y) 
for any g(-). Subtracting these two conditions we conclude that z L g(y), which is the same as 
Ezg(y) — 0. Now since the variable z itself is a function of y, we may choose g(y) — z to get 
Ez? — 0. We thus find that z is zero mean and has zero variance, so that, from Remark A.1, we 
conclude that z = 0, or equivalently, 2 = E (z|y), with probability one. o 


Example 1.3 (Suboptimal estimator for a binary signal) 


Consider again Ex. 1.2, where a is a BPSK signal that assumes the values +1 with probability 
1/2. Let us verify that the estimator $ — sign(y) is not optimal in the least-mean squares sense. 
We already know that this is the case because we found in Ex. 1.2 that the optimal estimator is 
tanh(y). Here we wish to verify the sub-optimality of sign(y) without assuming prior knowledge 
of the optimal estimator, and by relying solely on the orthogonality condition (1.12). 

According to Thm. 1.2, we need to verify that the estimator sign(y) fails the orthogonality test. 
In particular, we shall exhibit a function g(y) such that the difference x — sign(y) is correlated with 
it. Actually, we shall choose g(y) = sign(y) and verify that 


E [a —sign(y)]sign(y) # 0 (1.14) 


Let us first check whether the estimator @ = sign(y) is biased or not. For this purpose we recall 
that y = x + v and that 


+1 if z-v20 


Me TE { -1 ife+v<0 


We therefore need to evaluate the probability of the events x + v > 0 and æ + v < 0. For the first 
case we have 


x +v > 0 4 (z= + and v > —1) or (x 2 —1 and v > 1) 
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Now recall that x and v are independent and that v is a zero-mean unit-variance Gaussian random 
variable. Thus, let 


ES 


Pw>1) ĉa (1.15) 


Then 
P(v2z-1) = 1-P(W<-i) =1-P(v>1) = 1-a 


and we obtain 
P(x +v > 0) = (1 ~ a)/2+a/2 = 1/2 


Consequently, P(z +v < 0) = 1/2 and Esign(z + v) = 0. This means that the estimator 
$ = sign(y) is unbiased. We now return to (1.14) and note that 
E [æ — sign(y)]sign(y) = E asign(y) — 1 


Therefore, all we need to do in order to verify that (1.14) holds is to check that E asign(y) does not 
evaluate to one. To do this, we introduce the random variable z = asign(y) and proceed to evaluate 
its mean. It is clear from the definition of z that 


— + if (x = +1 and v > —1) or (x = —l and v < 1) 
-1 if (x = +1 and v < —1) or (x = —1 and v > 1) 


The events 
(x = +1 and v > -1) or (r— —landv < 1) 


each has probability 0.5(1 — a). Likewise, the events 
(x = +landv < —1) or (r— —landv > 1) 


each has probability 0.5a. It then follows that Ez = 1 — 2a x 1, so that x — sign(y) is correlated 
with sign(y). Hence, the estimator sign(y) does not satisfy the orthogonality condition (1.12) and, 
therefore, it cannot be the optimal least-mean-squares estimator. 

o 


1.4 GAUSSIAN RANDOM VARIABLES 


We mentioned earlier in Remark 1.1 that it is not always possible to determine a closed 
form expression for the optimal estimator E (a|y). Only in some special cases this calcu- 
lation can be carried out to completion (as we did in Ex. 1.2 and as we shall do in another 
example below). This difficulty will motivate us to limit ourselves in Part II (Linear Esti- 
mation) to the subclass of linear (or affine) estimators, namely, to choices of h(-) in (1.3) 
that are affine functions of the observation, say, h(y) = ky + b for some constants k 
and 6 to be determined. Despite its apparent narrowness, this class of estimators performs 
reasonably well in many applications. 

There is an important special case for which the optimal estimator of Thm. 1.1 turns 
out to be affine in y. This scenario happens when the random variables x and y are jointly 
Gaussian. To see this, let us introduce the matrix 


RÊ 0% Ory 
e. c? 
zy Vy 


where (02, o?, Gzy} denote the variances and cross-correlation of x and y, respectively, 


c2 = E(x — z)y?, a? =E(y-9)’, Ory = E(x —%)(y — 9) 


The matrix R can be regarded as the covariance matrix of the column vector col(z, y}, 


see(ls]-(DCG- 1) 


where the symbol T denotes vector transposition, and the notation col{a, 3} denotes a 
column vector whose entries are o and B. As explained in Sec. A.4, every such covariance 
matrix is necessarily symmetric, R = RT. Moreover, R is also nonnegative-definite, 
written as R > 0. To proceed with the analysis, we are going to assume that the covariance 
matrix R is positive-definite and, hence, invertible — see Prob. I.6. 

Now the joint pdf of two jointly Gaussian random variables (m, y) is given by (see 
Sec. A.5 forareview of Gaussian random variables and their probability density functions): 


Also, the individual probability density functions of æ and y are given by 


dads oem == exp [-(z — )?/20?} 


1 =\2 jo 2 
—— — expi-(y— 2c 
fyly) vo p{-(y— 3)/2e;] 
According to Thm. 1.1, the least-mean-squares estimator of x given y is $ = E (aly), 
which requires that we determine the conditional pdf f, (xy). This pdf can be obtained 
from the calculation: 


fey(xy) = "Ey 


epi-i[s-s y-ajat| ez) 
27 ydet R y- (1.18) 
1 1 ' 
—— — exp {-(y — g)?/2e2 
NI p(-(y - 7)?/202} 
In order to simplify the above ratio, we shall use the fact that R can be factored into a 


product of an upper-triangular, diagonal, and lower-triangular matrices, as follows (this 
can be checked by straightforward algebra): 


_[ 1 5c c? 0 1 0 
Sha edana M 


where we introduced the scalar 


! 


3. A- 035.0223 728 
= Og Cay Sy 
which is called the Schur complement of c? in R; it is guaranteed to be positive in view of 


the assumed positive-definiteness of R itself. Indeed, and more generally, let 


_[ A B 
E 
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be any symmetric matrix with possibly matrix-valued entries {A, B, C) satisfying A = AT 
and C = C". Assume further that C is invertible. Then it is easy to verify by direct 
calculation that every such matrix can be factored in the form 


[æ ello T Jle ellos t] 


Xy-A-BC-!B'! 


is called the Schur complement of R with respect to C. The factorization (1.19) is a special 
case of this result where the entries ( 4, B, C} are scalars: A = o2, B = ogy, and C = c2. 
Moreover, the determinant of a positive-definite matrix is always positive — see Sec. B.1. 
We see from (1.19) that det R = 0202, so that c? is necessarily positive since det R > 0. 
Now, by inverting both sides of (1.19), we find that the inverse of R can be factored as 


“= 2 o 1/o? 0 1 —Ozy/o? 
iiia | —0ry/03 1 | | 0 1/05 | | 0 1 y | (1.20) 


where 


where we used the simple fact that for any scalar a, 
EE 1 a]*_ [1 -a 
a 1 -a 1J’ 0 1 ^ |0 1 


= EN pee —-2(4, 532 35,2 
[ Aa y-g jae | = [(x z) A (y 8) + (y 3 


where the right-hand side is expressed as the sum of two quadratic terms. It follows that 


ew {-5[ s-a s-i]m [275 h} = 


y-9 
exp (-(z — z) - ogyo77(y — 9)]?/207} exp (-(y — 9)?/207} 


This equality, along with detR = gas, allows us to simplify expression (1.18) for 
f wiy (T ly) to 


falah) = — Ja exp {Ile - 2) ~ osos" P /2e?) 


This expression has the form of the pdf of a Gaussian random variable with variance c? 
and mean value Z + 07,0; ?(y — 7). Consequently, the optimal estimator is given by the 
affine relation: 


(1.21) 


(1.22) 


Observe that, in this Gaussian case, the m.m.s.e. is completely specified by the second- 
order statistics of the random variables {x, y) (namely, 02, 02, and azy). Note also that 


the m.m.s.e. is smaller than c2. 


Example 1.4 (Correlation coefficient) 


A measure of the correlation between two random variables is their correlation coefficient, defined 
by 
A 
Pry = Ozy/020y 
It is shown in Prob. L6 that pzy always lies in the interval [—1, 1]. As psy moves closer to zero, the 
variables x and y become more uncorrelated (in the Gaussian case, this also means that the variables 
become less dependent). We see from (1.22) that the m.m.s.e. in the Gaussian case can be rewritten 
in the form 
m.m.se. = o2(1— Dou) 


This shows that when pz, = 0, which occurs when Cey = 0, the resulting m.m.s.e. is c2. Also, 
from (1.21), the estimator collapses to = z. That is, we are reduced to the simple estimator 
studied in Sec. 1.1. This is expected since in the Gaussian case, a zero cross-correlation means that 
the random variables x and y are independent so there is no additional information available that we 
can use to estimate x, besides its mean and variance. o 


Example 1.5 (Gaussian noise) 


Let x denote a Gaussian random variable with mean Z = 1 and variance c? = 2. Similarly, let 
v denote a Gaussian random variable independent of x, with mean J = 2 and variance c2. Now 
consider the noisy measurement 

y =2r+v 


and let us estimate x from y. According to (1.21), we need to determine the quantities {Ẹ, Czy, ov}. 
From the above equation we find that 


J=27+07=4 
The independence of x and v implies that 
o? = 40? +02 =8 +0? 
Finally, the cross-correlation oz, is given by 
Ory = E(w - Z)(y- g) - E(z—- 1)2r +v- 4) =4 


where we used 


and 
Ezv =EgEv=2 


Using (1.21) and (1.22) we obtain 


ee 
i Bae: 


Moreover, since o2. = o2 — c2, we also find that 


2. 16 
~ 8-—c2 
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W. have focused so far on scalar real-valued random variables. The results however 
can be extended in a straightforward manner, by using the convenience and power of the 
vector notation, to the cases of vector-valued and complex-valued random variables. 

These two situations are common in applications. For example, in channel estimation 
problems, the quantities to be estimated are the samples of the impulse response sequence 
of a supposedly finite-impulse-response (FIR) channel. If we group these samples into a 
vector z, then we are faced with the problem of estimating a vector rather than a scalar 
quantity. Likewise, in quadrature amplitude modulation (QAM) or in quadrature phase- 
shift keying (QPSK) transmissions over a communications channel, the transmitted sym- 
bols are complex-valued. The recovery of these symbols at the receiver requires that we 
solve an estimation problem that involves estimating complex-valued quantities. 


2.1 OPTIMAL ESTIMATOR IN THE VECTOR CASE 


It turns out that the optimal estimator in the general vector and complex-valued case is still 
given by the conditional expectation of x given y. To see this, let us start with a special 
case. Assume x and y are both real-valued with æ a scalar and y a vector, say, 


y= col{y(0), y(1), fed yla = 1)} 


As before, let £ = h(y) denote an estimator for x. Since y is vector-valued, the function 
h(-) operates on the entries of y and provides a real scalar quantity as a result. More 
explicitly, we write 
& =h(y(0),y(1),...,y(q—-1)) 

The function A(-) is to be chosen optimally by minimizing the variance of the error = 
xz — ĉ, i.e., by solving 

min E 2? 

AC) 
The same argument that we used to establish Thm. 1.1 can be repeated here to verify that 
the optimal estimator is still given by 


& = E(aly) = Elaly(0),y(1),.-.,¥(¢-1)] (2.1) 


The only difference between this result and that of Thm. 1.1 is that the conditional expec- 
tation is now computed relative to a collection of random variables {y(z)}, rather than a 
single random variable. Moreover, for any function g(-) of y, the orthogonality condition 
(1.5) extends to this case and is still given by 


Ela — E(z|y))g(y) =0 (2.2) 
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Example 2.1 (Noisy measurements of a BPSK signal) 


Let us return to Ex. 1.2, where æ is a BPSK signal that is either +1 or —1 with probability 1/2 
each. Assume that we collect two noisy measurements y(0) and y(1) of æ, say, y(0) = x + v(0) 
and y(1) = x + v(1), where (v(0), v(1)) are zero-mean unit-variance Gaussian random variables 
that are independent of each other and of x. The value of x is the same in both measurements (i.e., if 
it is +1 in the measurement y(0), it is also +1 in the measurement y(1), and similarly for — 1.) We 
may interpret {y(0), y(1)} as the noisy signals measured at two antennas as a result of transmitting 
x over two additive Gaussian-noise channels — see Fig. 2.1. 


v(0) 


Pu iL 
0) L 


v(1) 


FIGURE 2.1 Reception by two antennas of a symbol æ transmitted over two additive Gaussian- 
noise channels. 


We can then pose the problem of estimating x given both measurements (y(0), y(1)). According 
to (2.1). the solution is given by 

& = E [aly(0),y(1)] 
The evaluation of the conditional expectation in this case is a trivial extension of the derivation given 
in Ex. 1.2, and it is left as an exercise to the reader — see Prob. I.13, where the more general case of 
multiple measurements is treated. The result of that problem shows that 


& = tanh[y(0) + y(1)] 


In the context of the two-antenna example of Fig. 2.1, this result leads to the optimal receiver structure 


shown in Fig. 2.2. 
Y vo i 
e o. HE 
, 
- y(1) ‘i, l4 
| 


v(1) 


v(0) 


FIGURE 2.2 Optimal receiver structure for recovering a symbol æ from two separate 
measurements over additive Gaussian-noise channels. 
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Let us now study the general case and determine the form of the optimal estimator for a 
vector-valued random variable a given another vector-valued random variable y, with both 
variables allowed to be complex-valued as well. Thus, assume that x is p—dimensional 
while y is g—dimensional. 

Again, let & = h(y) denote an estimator for x. Since a and y are vector-valued, the 
function A(-) operates on the entries of y and provides a vector quantity as a result. More 
explicitly, we can write for the individual entries of ĉ and y, 


$(0) ho[y(0), (1). ..., (a — 1)] 

ĉ(1) hily(0), y(1),--., y(q- 1)] 

&(2) = h5|y(0), y(1),---,y¥(a—- 1)] 
&(p — 1) hs [y (0), (D... (a — I 


where the {hj,(-)} represent the individual mappings from the observation vector y to the 
estimators (2(k)). We can then seek optimal functions (^, (:)) that minimize the variance 
of the error in each component of x, namely, each ^4 (-) is determined by solving 


hat) E |&(k)|? (2.3) 


where X 
&(k) = x(k) — hely) 
Actually, this formulation is equivalent to solving over all (^, (-)) the following problem: 


min E x*z 
(i) Ua 


This is because the quantity E ž* in (2.4) is the sum of the individual terms E |ž(k)|?, 
Ez'à = El&(0)? + Elz(1)? +... + El&(p— 1)? 
with each term E |Z(k)|? depending only on the corresponding function A; (-). In this way, 
minimizing the sum E $*$ over all (^; (-)) is equivalent to minimizing each individual 
term, E |&(k)|?, over its ^ (-). Note further that 
E£' = Tr(E@&") = Tr(Rg) 


That is, the scalar quantity E Z* $ in (2.4) is equal to the trace of the error covariance matrix 
Rz. This is because the trace of a matrix is equal to the sum of its diagonal elements and, 
therefore, for any column vector a, it holds that a*a = Tr(aa*). Then problem (2.4) is 
also equivalent to solving over all {hy (-)}: 


ho dl Q5 


Now the solution to the general problem (2.3) follows from the special case discussed 
at the beginning of this section. Indeed, if we express x(k) and h,,(-) in terms of their real 


and imaginary parts, say, 


a(k) 2 x(k) +jai(k), hely) È huy) + j hilu) 


then we can expand the error criterion as 
E |ar(k) — he (y)|? = E [arr(k) — hy) + E[ai(k) — hi)? 


and we are reduced to minimizing the sum of two nonnegative quantities over the un- 
knowns (A, 4 (-), hin (-)). This is equivalent to minimizing each term separately, 


pub E [a,(k) — h sly), ee E [xi(k) — hi (y)? 


and the solution we already know from (2.1) to be given by 


g,(k) = E (m (k)|y, (0), y; (0), y. (1); y; (1), ue UG 1), yi(q E 1)) 
(Kk) = E [x (k) |y, (0), yi (0), y, (1), yi (1), ks AC! T 1),yi(q = 1)) 


Therefore, the optimal choice for A, (-) is 
&(k) = E [æ(k)|y] 


so that the optimal estimator that minimizes the variances of the individual errors {%(k)} 
is 


E z(0)|y] 


acis) A E z(1)|y] 


(2.6) 


E[e(p — 1)ly] 


Likewise, using the property (1.5) of conditional expectations, we conclude that the or- 
thogonality condition in this case is still given by 


E[z - E(ziy)|g(y) =0 (2.7) 


for any function g(-) of the observation vector y. 


Theorem 2.1 (Optimal estimation in the vector case) The least-mean-squares 
estimator of a (possibly complex-valued) vector z given another (possibly 
complex-valued) vector y is the conditional expectation of æ given y, i.e., 
£ = E(aly). This estimator solves 


min Tr( Rz) 
c 


where Rz = E22" and $ — xz — d. 


Example 2.2 (Estimation of transmitted symbols) 


Consider again the setting of Ex. A.4 and assume N = 2, so that we are interested in estimating the 
vector x = col(s(0), s(1)) from the observation y = col{y(0), y(1)), where 


y(0) = 8(0) + v(0) and y(1) = s(1)-40.5s(0) + v(1) 
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Here we are assuming that transmissions start at time 0 so that s(—1) = 0. If we introduce the 2 x 2 


matrix 
H= 1 0 
0.5 1 


and the vector v = col{v(0), v(1)}, then the above equations can be written more compactly in 
matrix form as 

y =Hzr+v 
We are therefore faced with the problem of estimating x from y, with the noise term v assumed 
independent of x. Now recall that the symbols s(i) are either +1 or —1 with probabilities p or 1 — p, 
respectively. Hence, the vector x can assume one of four values: 


«b Ip SI JI 


with probabilities (p^, (1 — p)’, p(1— p), p(1 — p)}, respectively. Observe that we are denoting 
the four possibilities for æ by (mi, i = 0,...,3) for compactness of notation. Let also q = 1 — p. 
Moreover, the pdf of v is Gaussian and given by 


{I> 


(mo, mi, mo, ma) 


= T 
fale) = ze ep {=} 
since the covariance matrix of v is assumed to be the identity matrix. It then follows that the pdf of 
y is given by 
folu) =P" fo (y - Hmo) + d! fe (y - Hmi) + pafe (y - Hma) + pafe (y — Hma) 
Similarly, we obtain, as in Ex. 1.2, that 
fey(z.9y) = felz): foly — Hz) 
= P fe(y— Hmo)5(x — mo) + foly — Hmi)5(x — mi) 
+ pafe(y — Hma)ó(z — m2) + pafe(y — Hma)ó(z — ma) 


The expressions so derived for f, (y) and fe,y (z, y) allow us to evaluate fejy (zly), from which we 
can evaluate the desired conditional expectation E (zm|y) and, consequently, (8(0), 8(1)). This final 
computation is left as an exercise to the reader — see Prob. I.15. 

o 


2.2 SPHERICALLY INVARIANT GAUSSIAN VARIABLES 


We saw earlier in Sec. 1.4 that for scalar real-valued Gaussian random variables (x, y]. 
the optimal estimator of x given y depends in an affine manner on the observation y. The 
same conclusion holds in the general vector complex-valued case. 

So assume that x and y are jointly Gaussian random vector variables with a nonsingular 
covariance matrix 


where 
R, = E(z-ziz)(r—z)' 
Ry = E(y-9(y- 9) 
Ry = E(r-z)y-'-H, 


The variables {x, y) are assumed to be complex-valued with dimensions p x 1 for x and 
q x 1 for y. 
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falz) Jan vdet Rz Exp { 2 (oa) Hee a} VARIABLES 


: Sua Rw} 


1 
^) = UG dE, e| 2 
1 1 1 E 7 2 gz- 
fey(Z,y) = VG Jat R exp {-5 | (r-2) (y—9)' ]R i | y-g I} 


In particular, observe that if z and y were uncorrelated, i.e., if Rp, = 0, then the covari- 
ance matrix R becomes block diagonal, with entries {R,, Ry}, and it is straightforward to 
verify from the above pdf expressions that in this case fe,4(z,y) = fx(x) fy(y). In other 
words, uncorrelated real-valued Gaussian random variables are also independent. 

When, on the other hand, x and y are complex-valued, they need to satisfy two condi- 
tions in order for their individual and joint pdfs to have forms similar to the above in the 
Gaussian case. These conditions are known as circularity assumptions, and the need for 
them is explained in Sec. A.5. The conditions are as follows. Each variable is required to 
be circular, meaning that (a, y) should satisfy 


E(y-g)(y-i)-0 and E(z-z)(zx-z)'-0 


with the transposition symbol T used instead of the conjugation symbol *. The variables 
are also required to be second-order circular, i.e., 


E(z-3)(y-5-0 


These circularity assumptions are not needed when the variables (a, y) are real-valued. 
The circularity of x in the complex case guarantees that its pdf in the Gaussian case will 


have the form i i 
DE = my p-1 -—- 
falz) = rE exp {-(x - Z)" Rz! (z z)} 


Likewise, the circularity of y guarantees that its pdf will have the form 


NEM 
~ T1 det Ry 


fulu) exp (-(y - ' Ry (y - g)) 


The second-order circularity of x and y guarantees that the joint pdf of {a, y] will have 
the form 


DNODES ee exp {=| (r-2) (y~g)* ] R | n i} 


Thus, observe again that if æ and y were uncorrelated, then the above pdf expressions lead 
to 


feu (ty) = falz): fy(y) 


which shows that uncorrelated circular Gaussian random variables are also independent. 
This conclusion would not have held without the circularity assumptions in the complex 
case. We may add that circular Gaussian random variables are also called spherically- 
invariant Gaussian random variables. 
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Now the least-mean-squares estimator of x given y requires that we determine the con- 
ditional pdf fi, (zy). This can be obtained from the calculation 


Mb exp(-(y - 3) Ry (y- 9] 


Fey lu) = 


Following the same argument that we used earlier in Sec. 1.4, we can simplify the above 
expression by introducing the block upper-diagonal-lower triangular factorization (whose 
validity can again be verified, e.g., by direct calculation): 


a | Re - Sese" e RaR DD I 0 
— [Rz R EU I 0 Ry || RyRy I 


where © is the Schur complement of Ry in R, namely, 
B= R- Ray Ry Rye 


Inverting both sides of the above factorization for R we get 


e I DII o I -R&R;! 
“| -Ry Rye I 0 Rjlo I 


which allows us to express the term 


[ @-2)" -7 E 


y-y 


which appears in the expression for fz,,(x,y), as the following separable sum of two 
quadratic terms, 


[(e — 2) - RayRy*(y - 9) E (x 2) - RR; (y- 3) + (y -'R; (y - 9) 
Substituting this equality into the expression for fey (|y), and using 


det R = det X. det Ry 


we conclude that 
fely(zly) = 
1 1 r M ETUR 7 B £ 
z gag OP {le - 2) - Rey Ry (y - WE (e - 2) - RB, (y - 9))} 


which can be interpreted as the pdf of a circular Gaussian random variable with covariance 
matrix X and mean value Z + Rzy R; (y — i). We therefore conclude that 


$ È E(z|y) = 2+ RR; (y - 9) 


and the resulting m.m.s.e. matrix is 


z = R,-—Re=R,- Bey Rye 


These are the extensions to the vector case of expressions (1.21) and (1.22) in the scalar 
case. Note further that in the zero-mean case we obtain 


with { Rz, Ry, Rzy} defined accordingly, 
R,=Eau*, R,=Eyy*, Roy =Eay” 


Observe from the above expressions that the solution of the optimal estimation prob- 
lem in the Gaussian case is completely determined by the second-order moments of the 
variables {x, y) (i.e., by Re, Ry, and Rzy). This means that, in the Gaussian case, the 
m.m.s.e. matrix can be evaluated beforehand by the designer (i.e., prior to the collection of 
the observations); a step that provides a mechanism for checking whether the least-mean- 
squares estimator will be an acceptable solution. 


Lemma 2.1 (Circular Gaussian variables) If z and y are two circular and 
jointly Gaussian random variables with means {Z, j) and covariance matrices 
{Rz, Ry, Ray}, then the least-mean-squares estimator of x given y is 


$ = PrRQR,(y- 3) 


and the resulting minimum cost is m.m.s.e. = Rs — ReyRy' Rye. 


2.3 EQUIVALENT OPTIMIZATION CRITERION 


A useful fact to highlight at the end of this chapter is that the optimal estimator E (z|y) 
defined by (2.6), which solves problems (2.3)-(2.5), is also the optimal solution of another 
related matrix-valued error criterion (cf. (2.9) further ahead), as we now explain. 

Thus, consider the following alternative formulation. Assume that we pose the problem 
of estimating x from y by requiring that the functions (A, (-)) be such that they minimize 
the variance of any arbitrary linear combination of the entries of the error vector, say, 
a* (z — h(y)) for any a. That is, assume we replace the optimization problem (2.3) by the 
alternative problem 


min E la* x, for any column vector a (2.8) 
{he} 


The error vector & is dependent on the choice of h and, therefore, the covariance matrix 
E zi" is also dependent on ^. Let us indicate this fact explicitly by writing 


I> 


R;(h) Ezz” 
Now note that 
E\a*z|? = a* Rz(h)a 


so that problem (2.8) is in effect seeking an optimal function A? such that, for any vector a 
and for any other A, 
a*Rz(h)a > a*Rg(h?)a 
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That is, the difference matrix Ra(h) — Rz(h°) should be nonnegative-definite for all A. 
For this reason, we can equivalently interpret (2.8) as the problem of minimizing the error 
covariance matrix Rz itself, written as 


min EzZ" 


Comparing with (2.4) we see that we are replacing the scalar E £" € by the matrix Era”. 
Let us now verify that the solution to (2.8), or equivalently (2.9), is again A?(y) — 
E (z|y). For this purpose, we recall that, for any h(y), 


$-rz—-d4-mc-h(y) 
so that the covariance matrix Rz(h) is given by 


Ra(h) E[z — h(y)]lz — ^(y)" 
Ezz'-Ezh'(y)-Eh(y)z" + Eh(y)h' (y) 


We now verify that 
Rg(h) — Rg(h°) 2 0 


for any choice of h. Indeed, from the orthogonality property (2.7), we have that x — h° (y) 
is uncorrelated with any function of y. Hence, 


Rg(h°) = Ela — h'(y)lm — h*(y)" 
E [a — h^ (y)ie" 
= Ezsz'-Eh?(y)z" 


Subtracting from Rz(h) leads to 
Rg(h)— Re(h?) = -Emh'(y)— Eh(y)z* + Eh(y)h' (y) + EP (y)a" 
From the orthogonality property (2.7) we again have that 
E[z - ^^(y))h^*(y) 20, Ela — h"(y))h" (y) 20 
so that 
Ezh"(y)- Eh'(y)h"(y) and Eah*(y) = ER" (y)h'(y) 
These two equalities allow us to rewrite the difference Rz(h) — R5 (h^?) as a perfect square: 
Ra(h) — Rz(h^) = E [h° (y) — h(y))[h? (y) — Ay)!" 


The right-hand side is nonnegative-definite for all h, as desired. Finally, since the cost used 
in (2.4) is simply the trace of the error covariance matrix, we conclude that minimizing the 
error covariance matrix is equivalent to minimizing its trace. 


Lemma 2.2 (Cost function) The conditional expectation of x given y is op- 
timal relative to either cost 
min Tr(Rz) or min Hs 
c c 


^ 


where Rz = E£$* and £ — z — i. 


Summary and Notes 


l. this initial part we highlighted several concepts and results in least-mean-squares estimation 
theory. Some of these concepts are reproduced below in a less technical language in order to reinforce 
their importance. 


SUMMARY OF MAIN RESULTS 


1. The variance of a random variable serves as a measure of the amount of uncertainty about the 
variable: the larger the variance the less certain we are about the value it may assume in an 
experiment. 


2. The least-mean-squares error criterion is useful in that it leads to tractable mathematical solu- 
tions. The criterion is also intuitively appealing. By seeking to minimize the variance of the 
estimation error we are in effect attempting to force this error to assume values close to its 
mean and, hence, to assume small values since the mean is zero. 


3. The least-mean-squares estimator of a random variable a given another random variable y is 
the conditional expectation estimator, namely, $ = E (z|y). This estimator is optimal in the 
sense that it minimizes the covariance matrix of the error vector (or, equivalently, its trace), 
i.e., it solves 

min EZ£' or min Tr(Ri) 
h(-) AC) 


4. A defining property of the least-mean-squares estimator is that the resulting estimation error 
is uncorrelated with any function of the observations, namely, 


E(x — £)g(y) =0 for any function g(-) of y 


In particular, L @andz L y. 


5. The evaluation of the conditional expectation, E (a|y), is a formidable task in most cases. 
However, for circular Gaussian random variables the estimator d is related to the observation 
y in an affine manner, namely, 


$-i-RaR,Q(y-j) 


R.-E(r-z)r-z), Ry=E(y-g)(y-g)" and Rey =E(@—2)(y—9)" 


In particular, the estimator is completely determined from knowledge of the first and second- 
order moments of (z, y), namely, their means, covariances and cross-covariance. 
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BIBLIOGRAPHIC NOTES 


Probability theory. The exposition in these two initial chapters assumes some basic knowledge 
of probability theory; mainly with regards to the concepts of mean, variance, probability density 
function, and vector-random variables. Most of these ideas were defined and introduced from first 
principles. If additional help is needed, some accessible references on probability theory and basic 
random variable concepts are Papoulis (1991), Picinbono (1993), Leon-Garcia (1994), Stark and 
Woods (1994), and Durrett (1996). The textbook by Leon-Garcia (1994) is rich in examples, and is 
particularly suited to an engineering audience. 


Mean-square-error performance. The squared-error criterion, whereby the square of the es- 
timation error is used as a measure of performance, has a distinguished history. It dates back to 
C. F. Gauss (1795), who developed a deterministic least-squares-error criterion as opposed to the 
stochastic least-mean-squares criterion of this chapter. Gauss’ formulation was motivated by his 
work on celestial orbits, and we shall comment on it more fully in the remarks to Part VII (Least- 
Squares Methods) when we study the least-squares criterion. A distinctive feature of the square-error 
criterion is that it penalizes large errors more than small errors. In this way, it is more sensitive to the 
presence of outliers in the data. This is in contrast, for example, to Laplace’s proposition to use the 
absolute error criterion as a performance measure (see Sheynin (1977)). Gauss was very much aware 
of the distinction between both design criteria and this is how he commented on his squared-error 
criterion in relation to Laplace’s absolute-error criterion: 


“Laplace has also considered the problem in a similar manner, but he adopted the absolute 
value of the error as the measure of this loss. Now if | am not mistaken, this convention is 
no less arbitrary than mine. Should an error of double size be considered as tolerable as a 
single error twice repeated or worse? Is it better to assign only twice as much influence to a 
double error or more? The answers are not self-evident, and the problem cannot be resolved 
by mathematical proofs, but only by an arbitrary decision.” 

Extracted from the translation by Stewart (1995). 


Besides Gauss’ motivation, there are many good reasons for using the mean-square-error crite- 
rion, not the least of which is the fact that it leads to a closed-form characterization of the solution 
as a conditional mean. In addition, for Gaussian random variables, it can be argued that the least- 
mean-squares error estimator is practically optimal for any other choice of the error cost function 
(quadratic or otherwise) — see, for example, Pugachev (1958) and Zakai (1964). 


Statistical theory. There is extensive work on the least-mean-squares error criterion in the sta- 
tistical literature. For instance, the result of Thm. 1.1 on the conditional mean estimator is related 
to the so-called Rao-Blackwell theorem from statistics (see, e.g., Caines (1988) and Scharf (1991)). 
However, in statistics, there is often a distinction between the classical approach and the Bayesian ap- 
proach to estimation. In the classical approach, the unknown quantity to be estimated is modeled as a 
deterministic but unknown constant; we shall encounter this situation in Chapter 6 while studying the 
Gauss-Markov theorem. The Bayesian approach, on the other hand, models the unknown quantity as 
a random variable, which is the point of view we adopted in this chapter. Such Bayesian formulations 
allow us to incorporate prior knowledge about the unknown variable itself into the solution, such as 
information about its probability density function. This fact helps explain why Bayesian techniques 
are dominant in many successful filtering and estimation designs; still the Bayesian approach has not 
been immune to controversies along its history (see Box and Tiao (1973)). 


Complex random variables. Complex variables, as well as complex random variables, are fre- 
quent in electrical engineering (and perhaps more so than in other disciplines). One notable example 
arises in digital communications where symbols are often selected at random from a complex constel- 
lation (or even in the complex representation of bandpass signals). Since complex random variables 
will play a prominent role throughout this textbook, we have chosen to motivate them from first 
principles in the body of the chapters. In Sec. A.5 we pursue their study more closely and focus, in 


particular, on the important class of complex-valued Gaussian random variables. It is explained in 
that section that a certain circularity assumption needs to be satisfied if the resulting pdf in the com- 
plex case is to be uniquely determined by the first and second-order moments of the complex random 
variable, as happens in the real case. The main conclusion appears in the statement of Lemma A.1, 
which shows the form of a complex Gaussian distribution under the circularity assumption. The 
original derivation of this form is due to Wooding (1956) — see also Goodman (1963) and Miller 
(1974). It is for this reason that, in future discussions, whenever we refer to a complex Gaussian dis- 
tribution we shall often attach the qualification “circular” to it and refer instead to a circular Gaussian 
distribution. 


Linear algebra. Throughout the book, the reader will be exposed to a variety of concepts from 
linear algebra and matrix theory in a self-contained and motivated manner. In this way, after pro- 
gressing sufficiently enough into the book, students will be able to master many useful concepts. 
Several of these concepts are summarized in the background Chapter B. If additional help is needed, 
some accessible references on matrix theory are the two volumes by Gantmacher (1959), the book 
by Bellman (1970), and the two volumes by Horn and Johnson (1987,1994). Accessible references 
on linear algebra are, for example, the books by Strang (1988,1993), Lay (1994), and Lax (1997). 
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Problems and Computer Projects 


PROBLEMS 


Problem I.1 (Rayleigh distribution) Consider a Rayleigh-distributed random variable z with 
pdf given by (A.3). Show that its mean and variance are given by (A.4). 


Problem 1.2 (Markov’s inequality) Suppose æ is a scalar nonnegative real-valued random vari- 
able with probability density function f; (x). Show that P[r > o] < Ex/a. 


Probiem 1.3 (Chebyshev's inequality) Consider a scalar real-valued random variable x with 
mean & and variance c2. Let y = (w—Z)*. Apply Markov's inequality to y to establish Chebyshev's 
inequality (A.5). 


Problem 1.4 (Conditional expectation) Consider two real-valued random variables x and y. 
Establish that E[E (z|y)] = Ea. That is, show that 


f stes = f E if few luda dy 


z 


where Sz and S, denote the supports of the variables x and y, respectively. 


Problem 1.5 (Numerical example) Assume y is a random variable that is red with probability 
1/3 and blue with probability 2/3. Likewise, x is a random variable that is Gaussian with mean 
1 and variance 2 if y is red, and uniformly distributed between —1 and 1 if y is blue. Find the 
individual and joint pdfs of (z, y). Find also Ex and E (aly). 


Problem 1.6 (Correlation coefficient) Consider two scalar random variables (2, y) with means 
(Z, gj), variances (02, 02), and cross-correlation czy. Define the correlation coefficient pry = 
Oxy/ (020). Show that |pz,| < 1. 


Problem I.7 (Fully correlated random variables) Consider two scalar real-valued random vari- 
ables æ and y with correlation coefficient pzy, means (Z, j), and variances (c2, c2). Show that 


les, = Lif, and only if,e-% = +% (y - 3). 
y 


Problem 1.8 (Chi-square distribution) Let æ be a real-valued random variable with pdf f; (2). 
Define y = x7. 


(a) Use the fact that for any nonnegative y, the event (y € y} occurs whenever (-/y < v < 
/y} to conclude that the pdf of y is given by 


2 y 2 y 
al e¥/2 


(b) Assume z is Gaussian with zero mean and unit variance. Conclude that fy (y) = VES 
for y » 0. 


y20 


fy (y) 
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Remark. The above pdf is known as the Chi-square distribution with one degree of freedom. A Chi-square 
distribution with k degrees of freedom is characterized by the pdf: 


1 
z (k-2)/26—v/2. >0 
fu (v) FATEN " eM erg 
where I'(-) is the so-called Gamma function; it is defined by the integral l'(z) = [57 s? 1e^*ds for 
z > 0. The function I'(-) has the following useful properties: (1/2) = yr, T(z + 1) = 2r (z) for 
any z > 0, and T(n + 1) = n! for any integer n > 0. 


Problem [.9 (Chi and Rayleigh distributions) Consider an FIR channel with two real-valued 
taps, z(1) and z(2). The taps are assumed to be independent zero-mean unit-variance Gaussian 
random variables. 
(a) Use the result of part (b) of Prob. I.8 to show that the random variable w = z?(1) + z?(2) 
has a Chi-square distribution with two degrees of freedom, i.e., fu (w) = 1e^"/? for w > 0. 
(b) Conclude that the random variable z = ,/x?(1) + z?(2) has a Rayleigh distribution, namely, 
show that fz(z) = ze? P forz > 0, with Z = J/7/2 and c2 = (2 — 7/2). 


Problem 1.10 (Uniform noise) Consider Ex. 1.2 but assume now that the noise is uniformly dis- 
tributed between —1/2 and 1/2. Show that è = 1 when y c [1/2,3/2] and $ = —1 when 
y € [-3/2, -1/2]. 

Problem l.11 (Estimator for a binary signal) Consider the same setting of Ex. 1.2 but assume 
now that the noise v has a generic variance oĉ. 


(a) Show that the optimal least-mean-squares estimator of x given y is ê = tanh(y/o2). Plot 
the estimate ĉ as a function of y for the values c? = 0.5, 1,2. 


(b) Argue that £ = sign(y) can be chosen as a suboptimal estimator. 


Problem l.12 (Biased measurements) Consider the same setting of Ex. 1.2 but assume now 
that the noise v has mean 9 and unit variance. 


(a) Show that the optimal least-mean-squares estimator of æ given y is = tanh(y — v). Plot 
the estimate 7 as a function of y for the values 0 = —0.5, 0, 0.5. 


(b) Argue that & = sign(y — T) can be chosen as a suboptimal estimator. 


Problem 1.13 (BPSK signal) Consider noisy observations y(i) = æ +v(i), where æ and v(i) are 
independent real-valued random variables, v(i) is a white-noise Gaussian random process with zero 
mean and variance c2, and æ takes the values +1 with equal probability. The value of æ is either +1 
or —1 for all measurements (y(;)). The whiteness assumption on v(i) means that Ev(i)v(j) = 0 
for i # j. 

(a) Show that the least-mean-squares estimate of z given (y(0),..., y(N — 1)) is 


(b) Assume z takes the value 1 with probability p and the value —1 with probability 1 — p. Show 
that the least-mean-squares estimate of x given (y(0), ... , y(.N — 1)} is 


in terms of the natural logarithm of p/(1 — p). 


(c) Assume the noise is correlated. Let Re = E vv* where v = col(v(0), v(1), ..., v(N — 1). 
Show that the least-mean-squares estimate of z given (y(0),..., y(N — 1)) is now 


ĉn = tanh [5m ( = erae 
2 l-p 
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56 where y = col{y(0), y(1),...,y(N — 1)) and h = col{1,1,...,1}. 


Part | 
PROBLEMS Problem 1.14 (Interfering signals) Two random variables x; and a2 are transmitted over known 


channels with gains o and o, as indicated in Fig. L1. The channels simply scale the random 
variables by (01,02). Zero-mean Gaussian noise is added to each transmission and the result is 
y = 0121 + 0223 + V1 + v2, where v1 and v» are independent of each other and of {sı and z2. 
The variances of the Gaussian noises vı and v2 are c2. and 825 respectively. 


Ui 


a, 91 


T2 o2 


v2 
FIGURE I.1 Two interfering signals transmitted over flat channels. 


The random variable xı assumes the value +1 with probability p and the value —1 with probability 
1 — p. The random variable x is distributed as follows: 


ife c ithe 5m +2 with probability q 
—2 with probability 1 — q 
+3 with probability r 


iff; = —1 then g= : ET 
—3 with probability 1 — r 


(a) Find the means of 71, xo, and y. 
(b) Find the minimum-mean-square-error estimator of 2, given y. 
(c) Specialize the results to the case a1 = a2 = 0,02, = 03, = 1/4. 


Problem 1.15 (Optimal receiver) Complete the derivation of Ex. 2.1. 
(a) Verify that A 
i= ED (x a(k)mxfoly — Hm) 
where a(0) = p°, a(1) = q” and a(2) = a(3) = pg. 
(b) Leta = p? - e7 à C 2v(0-3y0)*2) p — g e7 3 (24(0)+3u(2)+2) o = pq: e- 36-290) 90)) 
and d = pq - e~ 2(2¥(°)-4)), Show that the expression in part (a) leads to 


(0) | _ 1 a—b+c-—d 
a1) | a+b+c+d| a-b-c+d 


Problem 1.16 (Exponential distribution) Suppose we observe y = x + v, where x and v are 
independent real-valued random variables with exponential distributions with parameters À1 and A2 
(i Æ Ag). That is, the pdfs of z and v are fa (z) = A1e ^? for x > 0 and fy (v) = Age"? for 
v > 0, respectively. 
(a) Using the fact that the pdf of the sum of two independent random variables is the convolution 
of the individual pdfs, show that 


A1» azy, [(Aa—Aay 
= ; : ES > 
fu(y) do — Ai € le i, y20 


(b) Establish that fe y(x, y) = A1A2e02 ^*^ for x > 0 and y > 0. 
(c) Show that the least-mean-squares estimate of x given y = y is 


1 ew 


Ài — Àz e| — em AY 


Problem 1.17 (Equivalent criterion) Show that the least-mean-squares estimator $ = E (aly) 
also minimizes E%*W 2, for any Hermitian nonnegative-definite matrix W. 


Problem 1.18 (Second- and fourth-order moments) Consider an M x M positive-definite 
matrix R and introduce its eigen-decomposition (cf. Sec. B.1), R = DA Aiuiujz, where the A; 
are the eigenvalues of R (all positive) and the u; are the eigenvectors of R. The u; are orthonormal, 
i.e, uju; = O for alli # j and uju; = 1. Let h be a random vector with probability distribution 
P(h = uj) = X/ Tr(R). 

(a) Show that Ehh* = R/Tr(R) and Ehh*hh* = R/Tr(R). 


(b) Show that Eh* R^! h = M/Tr(R) and Ehh* R^! hh* = I/Tr(R), where I is the identity 
matrix. 


(c) Show that E h*h = 1 and Eh = 1 


M " " 
T(R) Deja ^u. 
Problem 1.19 (Independent and Gaussian variables) Consider two independent and zero- 
mean random variables {u, w}, where u is a row vector and w is a column vector; both are M- 
dimensional. The covariance matrices of u and w are defined by Eu*u = o2L and Eww* = C. 
In addition, u is assumed to be circular Gaussian. Let e; = uw. 


(a) Show that E lea? = o2Tr(C). 


(b) Use the result of Lemma A.3 to show that E||u|l? - |ea|? = (M + 1)o4Tr(C), where the 
notation || - || denotes the Euclidean norm of its argument. 


Problem 1.20 (Fourth-moment) Assume u is a circular Gaussian random row vector with a di- 
agonal covariance matrix A. Define z = ||u||?. What is the variance of z? 


Problem 1.21 (Covariance equation) Consider two column vectors {w, z} that are related via 
z = w + pu” (d— uw), where u is a circular Gaussian random variable with a diagonal covariance 
matrix, Eu*u = A (u is a row vector). Moreover, u is a positive constant and d = uw? + v, 
for some constant vector w° and random scalar v with variance c2. The variables (v, u, w} are 
independent of each other. Define eg = u(w? — w), as well as the error vectors z = w^ — z and 
Ùw = w° — w, and denote their covariances by (Rz, Ro). Assume Ez = Ew = v^, while all 
other random variables are zero-mean. 


(a) Verify that 2 = 1b — pu* (ea + v). 
(b) Use the result of Lemma A.3 to show that 


Rz = Ro — uRaA — pARs + p? (ATr(RaA) - ARSA) + pol 


(c) How would the result of part (b) change if were real-valued Gaussian and all other variables 
were also real-valued? 
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COMPUTER PROJECT 


Project 1.1 (Comparing optimal and suboptimal estimators) The purpose of this project’ 
is to compare the performance of an optimal least-mean-squares estimator with three approximations 
for it, along the lines discussed in Ex. 1.2. Thus consider the setting of Prob. 1.13. 


(a) 


(b) 


(d) 


Write a MATLAB program that generates a BPSK random variable æ that is equal to +1 with 
probability p and to —1 with probability 1 — p. 

Simulate the estimator of part (b) of Prob. I.13 for different number of observations N. Gener- 
ate observations (y(i)) and plot w as a function of N for1 < N < 10, with all observations 
assumed generated by the same value of z — either --1 or —1, and using zero-mean Gaussian 
noise with unit variance. Plot 2x for the cases p = 0.1, 0.3, 0.5, 0.8. 


Compare the performance of the optimal estimate 2w with the averaged estimate 
ERU 
Env = N > y(i) 


for several values of N, say, for 1 < N < 300, and for the same values of p in part (b). 
Does it take many more samples NV for the averaged estimate Êx ,av to provide a good result 
compared with the optimal nonlinear estimate Zn? 


Fix p = 1/2, and define the nonlinear decision device: 


dd» Cig eh 
Nee Wien 


Consider also the alternative (sign-of-optimal) estimate Zaec = sign [ĉn]. It is clear that 
Zdec assumes the values +1, whereas the optimal estimate ĉn does not. Is Zaec a better 
estimate than 2x? The answer in the mean-square sense is of course negative since we already 
know that 2x is the best estimate. To verify this fact do the following. Fix the number of 
observations at N = 10. Then perform 1000 experiments, with each experiment 7 resulting in 
an optimal estimate Z10(7) and an estimate £aec(1). For each estimate, the value of z is fixed 
at either +1 or — 1. Compute the sample variances 


j 1000 -— | 1000 mM 
7005 2, i" - êl, 1095 2. ^ - «| 
Which one is smaller? Repeat for the following (sign-of-average) estimate: 
sign = Sign [£ Nav] 


That is, apply the decision device to the estimate that is obtained from averaging. 


2MATLAB programs that solve all computer projects in the book can be downloaded for free by all readers 
(including students and instructors) from the publisher or author's websites. The purpose of these programs is 
to allow readers to experiment freely with the concepts covered in the chapters. The readers may also download 
extensive commentary and typical performance plots for all computer projects from the same websites. Detailed 
solutions also appear in the textbook by Sayed (2003). 
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Chapter 3: Norma! Equations 
Chapter 4: Orthogonality Principle 
Chapter 5: Linear Models 
Chapter 6: Constrained Estimation 
Chapter 7: Kalman Filter 


Summary and Notes 


Problems and Computer Projects 


CHAPTER 3 


Normal Equations 


l, Secs. 1.2 and 2.1 we studied the problem of determining the optimal function h(-) that 
minimizes the mean-square error of estimating a random variable x from another random 
variable y. Specifically, we solved 


min Eza* 3.1 
rey (3.1) 


over all functions h(-) of y. The optimal solution was found to be the conditional expecta- 
tion of x given y, i.e., 
& = E (aly) 


Such conditional expectations are generally hard to evaluate in closed-form, except in some 
special cases. We encountered three such cases in Part I (Optimal Estimation), namely the 
case of a BPSK signal embedded in additive Gaussian noise (Ex. 1.2), the case of jointly 
Gaussian random variables (studied in Secs. 1.4 and 2.2), and the case of random variables 
with exponential distributions (studied in Prob. I.16). Due to the difficulty in evaluating 
E (z|y) in general, it is common practice to restrict the choice of h(-) to the subclass of 
affine functions of y, i.e., to functions of the form 


h(y) = Ky +b (3.2) 


for some matrix K and for some vector b to be determined. For general vector-valued 
variables [z, y), K is a matrix and b is a vector. When {z, y) are scalars, then {K, b} 
will also be scalars. If x is a scalar and y is a vector, then K will be a row vector and 
b a scalar. In some applications, the variables (2, y) may be matrix-valued, in which 
case {K, b) will be matrix-valued. In other words, the dimensions of (X, b) need to be 
consistent with the dimensions of (z, y), and these dimensions will be obvious from the 
context. 

Affine functions of the form (3.2) are easier to implement than most nonlinear func- 
tions. For example, when K is a row vector, the product Ky amounts to an inner product. 
Moreover, by deliberately restricting A(-) to be an affine function of y, the evaluation of the 
optimal A(-) is greatly simplified. In particular, it will be seen that all we need to know in 
order to determine the optimal parameters (K, b} are the first and second-order moments 
of (z, y}, namely, Ex, Ey, Exax*, Eyy*, and Exy*. No other moments are needed. In 
contrast, evaluation of the optimal estimator, 2 = E (xy), requires full knowledge of the 
conditional pdf f; (z|y). 

Of course, the minimum mean-square error (m.m.s.e.) that results from using the affine 
estimator (3.2) will be larger than the m.m.s.e. that results from the optimal design (3.1). 
One notable exception is the case of jointly Gaussian random variables (a, y]. It was 
verified in Sec. 2.2 that for Gaussian random variables, the optimal estimator is an affine 
function of the observation (so that, in this case, the affine and optimal estimators achieve 
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the same m.m.s.e.) - see Lemma 2.1. Still, in general and for other signal distributions, 
the performance of the affine estimator (3.2) is reasonable enough for many important 
applications, and this fact explains its widespread adoption in current practice. 

In this chapter we shall study the linear (affine) estimation problem in some detail. Be- 
fore proceeding, we would like to remind the reader that the presentation in Part I (Optimal 
Estimation) has shown that all results for scalar real-valued random variables could be ex- 
tended rather immediately to vector and also complex-valued random variables by relying 
on the vector and complex conjugation notation. For this reason, we shall deal directly 
with the vector and complex-valued case in the sequel. 


3.1 MEAN-SQUARE ERROR CRITERION 


In order to simplify the presentation, we assume first that (zm, y) are zero-mean random 
variables. Later in Sec. 4.4 we show how the nonzero-mean case can be accommodated 
through the process of centering the random variables. 

Thus, consider two zero-mean vector-valued random variables x and y and let 


ll 


Ex=0, 9 Ê Ey=0, R, Ê Eza', Ry S Eyy', Rey = Ezy" 


Ri 


The dimensions of æ and y need not be identical, say, æ is px 1 and y is q x 1. The analysis 
can be adjusted to treat the case of matrix-valued data as well, say, where x is p x r and 
y is q x s with r and s larger than one. However, it is common (and also sufficient) to 
illustrate the main ideas by focusing on the case of column vectors æ and y. In this case, 
(Rz, Ry, Rzy} are (p x p,q x q,p x q} matrices, respectively. 

We now seek an affine estimator for x, namely, one of the form 


$-Kycb 


for some constants {K, b) to be determined, where K is p x q and bis p x 1. The determi- 
nation of ( K, b} is based on two considerations. First, the estimator should be unbiased, 
which means that it should satisfy 


E220 


But since 
Eĉ = KEy+b = 0+b=b 


we find that we must have b = 0. This means that the estimator that we are seeking is 
effectively a linear estimator, i.e., it is one of the form = Ky. Second, the coefficient 
matrix K should be chosen optimally so as to minimize the error covariance matrix (or its 
trace), as we now explain. 

Let (7) denote the individual rows of K. Then the estimator for each entry of æ, say, 
a (1), is given by the inner product k7y, 


&(0) koy 

&(1) iy 
. = = : y 

$(p — 1) kp-1y 
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The optimal choices for the column vectors {k;} are determined by solving 


min E jë)? | foreach i=0,1,...,p-1 (3.3) 


(i) = w(t) — (i) 
is the estimation error for the i-th entry of x. That is, each column vector k; is determined 


by minimizing the error variance of the corresponding entry in x. The optimization prob- 
lems (3.3) can be grouped together and stated equivalently as the problem of determining 


the matrix K by solving 
min E 2*z% 
m 


where & = x — &. This is because the scalar quantity E £"3 in (3.4) is simply the sum of 
the individual error variances that appear in (3.3), 


where 


Ez'$ = E|z(0)? + E|z(1)? + ... + Elz(p- 1)? (3.5) 


and each term E |z(i)|? depends on the corresponding k; alone. In this way, minimizing 
E &* over K is equivalent to minimizing each term E |%(i)|? over its k;, so that problems 
(3.3) and (3.4) are equivalent. We now proceed slowly and show how to solve (3.3) over the 
corresponding vector arguments (k;). Later in Sec. 3.4, we shall show how to minimize 
E "3 directly over the matrix argument K, rather than in steps over the individual (k;). 

Therefore, continuing with (3.3), we expand the cost function to obtain a quadratic 
expression in the unknown column vector k;, 


El;()? £ Elz(i) - Ky? 
E[æ(i) - k'yl[e(i) — kt] 
E lz (i)? - [Ez(i)y"]k; — k? [E yæ* (i)] + kj Ryki 


This is a scalar-valued cost function of a possibly complex-valued vector quantity, k;. We 
denote it by 


J(k) Ê Elz(i)? — [Ez(i)y"]ki — kl [Eyz" (i)] + Kt Ryki (3.6) 


The quantity E |z(i)|? that appears in the expression for J(k;) is the variance of z(i) and 
is therefore equal to the i-th diagonal entry of R,. We denote it by 


c; E|e()P 


Likewise, the quantity E æ(i)y* is the i-th row of the cross-covariance matrix Rzy. We 
denote it by Ray; = Ex(i)y*. In this way, we can rewrite J(k;) as 


a, — Roy iki — kj Ryo t ki Ryki (3.7) 


where (02 ,, Rz,,;, Ry} are known quantities and k; is the unknown column vector; o2; 
is a scalar, Rey, is a row vector, and Ry is a nonnegative-definite matrix. Moreover, 
Ryz,i = Rpy,;- Our objective is to minimize J (ki) over kj. 

We can proceed in at least two different ways. One way relies on standard differenti- 


ation techniques from calculus, while the second way relies on a completion-of-squares 


argument. While it may be easier for the reader to assimilate the differentiation argument 
due to familiarity with the concept of derivatives and function minimization, the second 
argument on completion of squares will serve as a powerful example of the convenience 
of the vector notation. Since this argument will be useful in other scenarios as well, we 
choose to include it here for the benefit of the reader, in addition to the differentiation ar- 
gument. Still, both arguments require careful reasoning, as we explain below. Later, when 
similar derivations are needed in other chapters, we shall move more swiftly through the 
presentation. 


3.2 MINIMIZATION BY DIFFERENTIATION 


We start with the differentiation argument. As the reader may recall from a basic course on 
calculus, in order to determine the vector k; that minimizes J(k;), we need to differentiate 
J(k;) with respect to each one of the entries of k;, namely, (ki;, j = 0,1,...,q—1} and set 
the derivatives equal to zero. The main complication that arises here, relative to what the 
reader may be familiar with from a standard course on calculus, is that the entries {k;; } are 
complex-valued, and we therefore need to explain what is meant by differentiating J(k;) 
with respect to a complex variable. This is done in Chapter. C, where we show how to 
differentiate a function with respect to a complex scalar and even a complex vector. When 
the rules derived in that appendix are applied to the cost function J(k;) in (3.7) we find 
that the complex gradient vector of J(k;) with respect to k; is given by 


Vi, J (ki) = — Ray +kjRy 


Observe that this result is consistent with what we would expect from the standard rules of 
differentiation for functions of real variables, with the vectors k; and k7 treated as different 
quantities. Also, if all data were real-valued, in which case 


J(k)) = Ex?(i) - (Ex(iy ki — k] [E yæ(i)] + k] Ryki 


then we would have obtained instead Vz, J(k;) = —Rey; + 2k] Ry, with an additional 
factor of 2 — see Chapter C. 

By setting the complex gradient equal to zero at the optimal choice k; = k?, we find 
that k? should satisfy the linear equations 


k* R, = Reyi, i=0,1,...,p—1 (3.8) 


The vector k? so obtained minimizes J(k;) since the Hessian matrix of J(k;) with respect 
to k; is equal to R,, which is a nonnegative-definite matrix. Hessian matrices are defined 
in Chapter C; they are obtained by further differentiating V+; J(k;) with respect to k7. 
If we collect the row vectors {k?*} from (3.8) into a matrix Ko we find that this desired 
solution matrix should satisfy 
KoRy = Rey (3.9) 


These equations are called the normal equations, for reasons explained later in Remark 4.1 
in the next chapter. 


3.3 MINIMIZATION BY COMPLETION OF SQUARES 


We now re-establish (3.9), from first principles, by using a completion-of-squares argument 
that avoids the need for dealing with complex gradients. Thus, consider the cost function 
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(3.7) and note that it can be expressed in matrix form as follows: 
2 — ] 
Jg epi k ] | E Rayi | | - | (3.10) 


with a Hermitian center matrix and with the unknown vector ki, and its conjugate trans- 
pose, multiplying from both sides. 
Now given any Hermitian matrix of the form 
_| A B 
v-|£ e] 
with A = A*, C = C", and C invertible, it can be verified by direct calculation that 


M can be factored into a product of a block upper-triangular, block-diagonal, and block 
lower-triangular matrices as: 


A B] [1 BC*)[= 0 I 0 
[E 2|-[i I IE: edes | 61D 


x $ A-BCOB' 


where 


is called the Schur complement of M with respect to C. This is a general matrix result; it 
is an immediate extension to Hermitian matrices of a similar factorization result that was 
described in a footnote in Sec. 1.4 for symmetric matrices. The validity of (3.11) can be 
verified by expanding the right-hand side. The factorization (3.11) for M is valid as long 
as the matrix C is invertible, which means that we cannot apply the result directly to the 
center matrix 


2 , 
Tri —Rezy,i 


7 ys Ry 
that appears in our expression (3.10) for J(k;). The reason being that the covariance matrix 
R, is only required to be nonnegative-definite (and, hence, possibly singular). However, 
there is a generalization of (3.11) for block matrices M with possibly singular matrices C. 
Indeed, it is easy to verify, also by direct calculation, that we can alternatively factor any 


such M as 
A Bi |ID x 0 I 0 
beatae ince RA caa 


where D is defined as any solution to the linear system of equations 
DC=B (3.13) 


and 
X = A- BD* 


Clearly, when C is invertible, the above factorization reduces to (3.11). However, when C 
is singular, many solutions D may exist for (3.13) and, consequently, many factorizations 
of the form (3.12) may be possible for M. Our arguments will show that for the matrix 
M in question, namely, the center matrix in (3.10), a factorization of the form (3.12) al- 
Ways exists since the corresponding equations (3.13) will always have a solution D — see 
Sec. 4.3. 


Applying (3.12) to the center matrix in (3.10) we can write 


92, 2 B E —k?* | p i 0 ]| p o 


Rac Ry 0 I 0 R, k? I 
(3.14) 
where k? is any solution to the linear system of equations 
k?” Ry = Ryu (3.15) 


Substituting the factorization (3.14) into expression (3.10) for J(k;) and expanding the 
right-hand side, we find that J(k;) can be expressed in the equivalent form 


J(k)) = (c2, — Ray ik?) + (ki — k?)* Ry (ki — k?) (3.16) 


This is a revealing form for J(k;) since only the second term depends on the unknown 
ki. But since Ry is nonnegative-definite, this second term is always nonnegative and it 
will be minimized (actually made equal to zero) by choosing k; = k? (see Prob. II.5). We 
conclude that the minimizing k; is given by any solution to (3.15). In addition, the resulting 
m.m.s.e. is given by any of the following equivalent forms: 


T(k?) = o2; — Ray sk? = oza — ke Ryk? = 03a — KP Ryo 


n 


(3.17) 


If we again collect the row vectors {k2*} into a matrix Ko we find that K, should satisfy 
(3.9), namely, Ko Ry = Rzy. Moreover, the minimum value of the related cost (3.5) can 
be seen to be (see Probs. II.5 and II.6): 


EG: = Tr(R, - K,R,K*) (3.18) 


3.4 MINIMIZATION OF THE ERROR COVARIANCE MATRIX 


The same completion-of-squares argument can be applied directly to the solution of (3.4) 
rather than the solution of the individual problems (3.3). To see this, let J( K) denote the 
error-covariance matrix, 


so that, as in (3.10), 
R -R I 
J(K)=|1 K | 4 alll $| 
[ | —Ry, Ry K 
Following the same arguments as in the previous section, we can factor the center matrix 


as 
Be. -Ra 2p -Ko || RE Re. 0 I 0 
-Ra R j (lo 1 0 R, | | -K 1 


where K, is any solution to Ko Ry = Rzy. It then follows that 


J(K) F (Rz = Rzy K3) t (K = Ko) Ry(K — Koy (3.19) 
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where the last term is nonnegative definite for any K since R, > 0. Now the criterion in 
(3.4) is related to J(K) via Ez" = Tr(J(K)] so that, from (3.19), 


EZ'$ > T(H,—HR.,K;) forany K (3.20) 


This is because the trace of the nonnegative-definite matrix (K — K,)R,(K — Ko)“ is 
nonnegative. This result follows from the fact that the trace of a matrix, which is the sum 
of its diagonal elements, is also equal to the sum of its eigenvalues — see Prob. 11.3. Now 
since the eigenvalues of a nonnegative matrix are necessarily nonnegative, it follows that 
Tr[(K — K.)R,(K — K.)*] 2 0. Equality in (3.20) is achieved by setting K = Ko in 
(3.19). 

This derivation reveals another aspect of the solution Ko. It not only minimizes the 
trace of the error covariance matrix, as in (3.4), but it also minimizes the error-covariance 
matrix itself since we also get J(K) > J(K,) for any K. We can therefore interpret K, 


as also the solution to the problem 


which is in terms of the error-covariance matrix, rather than its trace. The resulting mini- 
mum value is 
J(K,) = Rs - R,K; (3.22) 


The optimization problem (3.21) is interesting for two reasons. First, the cost function 
J(K) = E£&" is matrix-valued. That is, it assumes matrix values for each choice of K. 
Second, the unknown argument, K, is a matrix itself. In this way, problem (3.21) involves 
minimizing a matrix-valued cost function over a matrix-valued argument. 


3.5 OPTIMAL LINEAR ESTIMATOR 


We shall have more to say about the solution(s) A, of the normal equations Ko Ry = Rzy. 
For now, we summarize the main conclusions of the last two sections. 


Theorem 3.1 (Optimal linear estimator) Given zero-mean random variables x 
and y, the linear least-mean-squares estimator (l.l.m.s.e.) of a given y is 


z= Koy 


where Ke is any solution to the linear system of equations K,Ry = Roy. 
This estimator minimizes the following two error measures: 


minE£'$ and minEzz* 
K K 
The scalar cost on the left is the trace of the matrix cost on the right. The 


resulting minimum mean-square errors, as defined by (3.4) and (3.22), are 
given by 


minE&'& = Tr(R,- KoRyK$) 
minE&a* = Re- K,R,K; 


CHAPTER í 


Orthogonality Principle 


B... examining the properties of the solution(s) K, of the normal equations K,R, = 
Rzy, we consider some illustrative examples in the context of symbol estimation and chan- 
nel equalization. Additional examples and applications are discussed in Chapter 5. 


4.1 DESIGN EXAMPLES 


Example 4.1 (Noisy measurements of a binary signal) 


We reconsider Ex. 1.2 of a BPSK signal æ that assumes the values +1 with probability 1/2. The 
measurement y is y = 2 + v, where x and the disturbance v are independent of each other, with v 
being zero-mean Gaussian of unit variance. 


Both z and y have zero means so that, according to Thm. 3.1, the optimal linear estimator of æ is 
$ = koy, where the (now scalar) coefficient ko is obtained from solving koo? = Gzy. We therefore 
need to determine the quantities (02. czy}. Now since (z, v) are independent we have 


c? = +027 =141=2 


Moreover, 
Ory = Egy = Eg(z +v) =Eg?+0=Ez°=1 

so that ko = 1/2, and the optimal linear estimator is $ = y/2. That is, we simply scale the received 
signal by 1/2. In contrast, the optimal estimator was found in Ex. 1.2 to be given by the nonlinear 
transformation tanh(y). In addition, observe that the form of the linear estimator, ¢ = y/2, is valid 
regardless of whether the noise v is Gaussian or not (i.e., the Gaussian assumption on v is not needed 
to arrive at ko = 1/2). The form of the optimal estimator, $ = tanh(y), on the other hand, is very 
much tied to the Gaussian assumption on v. 


Let us now reconsider Ex. 2.1, where we collect two noisy measurements y(0) and y(1) of æ, 

say, 
y(02z-v(0 and y(1l)2z-v(l1) 

where (v(0), v(1)) are zero-mean unit-variance Gaussian random variables that are independent 
of each other and of x. The value of x is the same in both measurements (i.e., if it is +1 in the 
measurement y (0), it is also +1 in the measurement y(1), and similarly for — 1) — recall Fig. 2.1. 
Introduce the column vector y = col(y(0), y(1)}. Then, according to Thm. 3.1, the optimal linear 
estimator of z given y is £ = k;y, where k is now 1 x 2 and is obtained from the solution 
of the normal equations k; Ry = Rzy. To determine (Ry, Rzy} we proceed as follows. Since 
{x, v(0), v(1)) are independent we get 


Ley =| FOR Ev (o ] [2 1 
Ry = Eyy’ = Ey()y*(0 X E|y(2)? | 7 | EUR | 
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where we used the fact that 


Ey(0)y'(1) = E(x + v(0)(z + v(1))* =El2|? 21 


Likewise, 
Rzy = Eay* = [ Eay*(0) Eay*(1) [ = [ 1 1 [ 
so that = 
wakes p dti] HO 2] 
That is, 


è= c [y(0) +y) 


Again, this expression for the linear least-mean-squares estimator of a given {y(0), y(1)) holds re- 
gardless of whether the noises (v(0), v(1)) are Gaussian or not. Only the first and second moments 
of (v(0), v(1)), namely, their means and variances, are needed to determine k3. In the context of 
the two-antenna example of Fig. 2.1, the above result leads to the optimal linear receiver structure 
shown in Fig. 4.1. 


v(0) 
j y(0) ig 1/3 
x £ 
y(1) i 
v(1) 


FIGURE 4.1 An optimal linear receiver for recovering a BPSK transmission from two 
measurements in the presence of additive unit-variance uncorrelated noises. 


Example 4.2 (Multiple measurements of a binary signal) 


Continuing with Ex. 4.1, let us examine what happens if we increase the number of available mea- 
surements from 2 to N, say, 


y(i)-z-v(i,  i20,1...,N-1 


Introduce the observation vector y = col(y(0), y(1),..., y(N — 1)). Then, say, for N = 5, 


Rey -|1 111 yh R = 


kenny 
ER a Seed 
[E eR 
ee NOR ee 
Nenem 


so that 


-1 


£-R,B,y-|[1 1 111] 


Ree eR e b. 

E EHE HB 

PrP Dre 

FP NF Fe 

NPE Pe 
e 


We need to evaluate RT Due to the special structure of Ry, its inverse can be evaluated in closed 
form for any N, as we explain below. Later, in Sec. 5.5, when we reconsider this problem, we shall 
show how to evaluate 2 via a more direct route. 

Observe that, for any N, the matrix R, can be expressed as Ry = I + aal, where a is the N x 1 
column vector a = col(1, 1, 1,..., 1). In other words, Ry is a rank-one modification of the identity 
matrix. This is a useful observation since the inverse of every such matrix has a similar form (see 
Prob. II.1). Specifically, 

- 


4X LL aa = 1 T 
(I+aa") =1 ivpg = ran” 


where |ja/|? denotes the squared Euclidean norm of a, ||a||? = a" a. Using this result we find that 


BB = at (1- 


so that 


Example 4.3 (Transmissions over a noisy channel) 


Consider again the setting of Exs. A.4 and 2.1, where independent and identically distributed (i.i.d.) 
symbols (s(i)) are transmitted over the FIR channel C(z) = 1 + 0.5z^!. Each symbol is either 
+1 with probability p or —1 with probability 1 — p, and the output of the channel is corrupted 
by zero-mean additive white Gaussian noise v(i) of unit variance. The noise and the symbols are 
independent of each other (see Fig. 4.2). We also assume, for simplicity, that p — 1/2. We want to 
estimate the vector z = col(s(0), s(1)) from the observation vector y = col(y(0), y(1)}, where 


y(0) = s(0) + v(0), y(1) = s(1) + 0.5s(0) + v(1) (4.1) 


We are assuming that transmissions start at time 0 so that s(—1) = 0. According to Thm. 3.1, the 
optimal linear estimator of z is 2 = Koy, where K is 2 x 2 and is obtained by solving the normal 
equations Ko Ry = Rzy. We therefore need to determine ( Ry, Rey}. 


FIGURE 4.2 Data transmissions through an additive Gaussian-noise channel. 
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It follows from the relations (4.1) that 


2 s(0) A s ee ae 
ray =e | 1 | [aro vo]-[i P 


Moreover, 
Ay | m | [ro va] = | p ja | 
so that = 
=| al 2 ^ -$| 2 d 
0 1 1/2 9/4 17 | -1/2 2 
That is, 


&(0) = 8y(0)/17--29(1)/17 and a(1) = —2y(0)/17 + 8y(1)/17 


This example is pursued further below and in Probs. II.9 and 1.10. Later in Sec. 5.3 we study the 
general case of estimating a block of data {s(-)}, of generic length N, from a block of observations 
(v())- 

o 


Example 4.4 (Linear channel equalization) 


Consider again the setting of Ex. 4.1, and assume now that the transmissions started in the remote 

past (rather than at time 0, i.e., i > —oo) so that all random processes (s(i), v(i), y(z)} can be 
assumed to be wide-sense stationary — the need for this assumption will become evident soon. By 
a wide-sense stationary process s(-), we mean one with a constant mean and whose auto-correlation 
sequence is only a function of the time lag, i.e., rs (k) = E s(i)s* (i — k) is only a function of k. 


Now referring to Fig. 4.3, we see that the output of the channel at any time instant 7 is a linear 
combination of the current symbol (i) and the previous symbol s(i — 1), i.e., 


z(i) = s(i) + 0.5s(i — 1) 


We therefore say that the channel introduces inter-symbol interference or \S\, since a symbol trans- 
mitted at a prior time, (i — 1), interferes with the output at the time of the current symbol, s(7). The 
measurement y(ż) that is available at the receiver is a noisy version of z(i), namely, 


y(i) = s(i) + 0.5s(i — 1) + v(i) (4.2) 


The purpose of this example is to show how to design an equalizer for the channel. The function of the 
equalizer is to process the received signal {y(z)} in order to recover the transmitted symbol {s(7)}, 
or a delayed version of it, say, (s(i — A)} for some A. There are many different structures that can 
be used for equalization purposes. In Fig. 4.3 we show an FIR equalizer structure that consists of 
three taps {a:(0), o(1), a(2)}. Its output at any time instant i is given by the linear combination 


o(0)y(i)  a(1)y(i — 1) + a(2)y(i — 2) 


We wish to determine the taps {a(0), a(1), a(2)} so that the output of the equalizer is the optimal 
linear estimator for s(i) (we choose A = 0 in this example — see Prob. IL.17 for nonzero values of 
A). This procedure is known as linear minimum mean-square-error equalization in communications 
applications. We discuss it in greater detail in Sec. 5.4 for higher-order channels and equalizers, and 
also for nonzero values of A. The discussion here is meant to illustrate some of the key concepts. 


Observe from Fig. 4.3 that at any time instant i, the equalizer uses three observations {y(i}, y(i— 
1), y(i — 2)} in order to estimate s(i). Therefore, the observation vector is y = col(y(i), y(i — 
1), y(i — 2)) and the variable we wish to estimate is œ = s(i). We then know from Thm.3.1 that 


Channel fce ua. du wo g^ Quod Ee Germ E E are - B A RR, wa ORT ' 
Equalizer 


FIGURE 4.3 Linear channel equalization. 


the optimal linear estimator for æ is given by 


aA ŝli) = kiy 


where the row vector k7 is found from solving the normal equations k; Ry = Rzy. Once Kj is 
found, its entries give the desired tap coefficients {a(0), a(1), a(2)}, i.e., 


k= | a(0) at) a(2) | 


In order to find k3, we need to determine {Rzy, Ry}. Thus, let ry(k) denote the auto-correlation 
sequence of the stationary process {y(i)}, i.e., ry(k) = Ey(i)y" (i — k). Then 


To determine (ry (0), r (1), ry(2)}, we use the output equation (4.2). Multiplying it from the right 
by y" (i) we get 


yG)y' (i) = s(i)y' (i) + 0.5s(i — Dy" (i) + v(ijy* (i) 


Taking expectations of both sides, and recalling that the variables (s(i), s(i — 1), v()) are indepen- 
dent of each other, we find that 


Ey()y' (i) = ry(0) 

Es(i)y'() = Es(i[s(i)-O.5s(4— 1) c v()]! = 1 
Es(i—1)y'(i = Es(i-1)[s(t)+0.58(¢-1)+ v())] = 1/2 

Ev(iy'() = Ev(ü[s(i)--0.5s(i — 1) + v(i)]" = 1 


so that 

ry(0) =14+1/44+1 = 9/4 
Likewise, multiplying (4.2) from the right by y" (i — 1) and taking expectations of both sides, we get 
ry(1) = 1/2. Finally, multiplying (4.2) from the right by y* (i — 2) and taking expectations we get 
r4 (2) = 0. In summary, we find that 


9/4 1/2 0 
Ryc-3/2 9/4 1/2 
0 1/2 9/4 
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In a similar fashion, we can evaluate Rey, 
R.-Ezy = | Eslu) Esy(G-1) EsQQ'(-2) | 


Thus, multiplying (4.2) from the right by s*(i) and taking expectations we get Ey(i)s" (7) = 1. 
Likewise, multiplying (4.2) from the right by s* (i + 1) and taking expectations we get E y(i)s" (i + 
1) = 0. Similarly, Ey(i)s" (i + 2) = 0. It further follows from the assumed stationarity of the 
processes (s(i), v(i)) that 


Ey()s' (i 1) = Ey(i — 1)s' (i) = [E s(i)y* (i - 1)" 
and 
Ey(i)s'(i- 2) = Ey(é ~ 2)s' (1) = [E s(i)y* (i - 2)" 
Hence, 
Ray Sa [ 1 0 0 


Later, in Sec. 5.4, we describe a more general procedure for determining the quantities {Ry, Rey} 
for arbitrary channel and equalizer lengths. 
Using the just derived values for ( Rzy, Ry}, we are led to 


9/4 M2 0 
ks = RoyRy'=|1 0 0]] 1/2 9/4 1/2] =| 0.4688 -0.1096 0.0244 | 
0 1/2 9/4 


That is, o(0) = 0.4688, a(1) = —0.1096, and a(2) = 0.0244. Moreover, the resulting m.m.s.e. is 
m.m.s.e. = 02 — Reyko = 0.5312, where o2 = ø? = 1. A computer project at the end of this part 
illustrates the operation of optimal linear equalizers designed in this manner. 

o 


4.2 ORTHOGONALITY CONDITION 


The linear least-mean-squares estimator admits an important geometric interpretation in 
the form of an orthogonality condition. This can be seen by rewriting the normal equations 
(3.9) as KoE yy* = Exy* or, equivalently, 


E(x- K,y)y' = 0 (4.3) 


The difference z — Koy is the estimation error, 3. Therefore, equality (4.3) states that the 
error is orthogonal to (or uncorrelated with) the observation vector y, namely, 


which we also write as (see Fig. 4.4): 


$ly (4.4) 


We thus conclude that for linear least-mean-squares estimation, the estimation error is 
orthogonal to the data and, in fact, to any linear transformation of the data, say, Ay for any 
matrix A. This fact means that no further linear transformation of y can extract additional 
information about x in order to further reduce the error covariance matrix. Moreover, since 
the estimator Z is itself a linear function of y, we obtain, as a special case, that 


zlé (4.5) 


FIGURE 4.4 The orthogonality condition for linear estimation: & L y. 


That is, the estimation error is also orthogonal to the estimator. Actually, the orthogonality 
condition is the defining property of optimality in the linear least-mean-squares sense, as 
the following result shows. 


Theorem 4.1 (Orthogonality principle) Given two zero-mean random variables 
x and y, a linear estimator 2 = Koy is optimal in the least-mean-squares 


sense (3.3) if, and only if, it satisfies x — $ L y, i.e., E (æ — d)y* — 0. 


Proof: One direction was argued prior to the statement of the theorem. Specifically, if & is the 
optimal linear estimator, then we know from (4.4) that 2 L y. Conversely, assume 4 is a linear 
estimator for æ that satisfies x — ? | y, and let 2 = Ky for some K. It then follows from 
z— ê Lythat E(x — Ky)y* = 0, so that K satisfies the normal equations KR, = Rzy and, 
hence, from Thm. 3.1 we conclude that £ should be the optimal linear estimator. $ 


Remark 4.1 (Terminology) The designation normal equations for (3.9) is motivated by the fact 
that these equations arise from the orthogonality condition (4.3). In the adaptive filtering literature, 
an optimal solution K of the normal equations is often mistakenly called the “Wiener solution.” As 
explained in App. 3.C of Sayed (2003), Wiener solved a more elaborate problem. 


© 
Example 4.5 (Signal with exponential auto-correlation) 
Consider a scalar zero-mean stationary random process {z(t)} with auto-correlation function: 
R.(r) Ê Ez(t)z' (t - 7) = e7% (4.6) 


That is, the samples of z(t) become less correlated as the time gap between them increases. A 
so-called random telegraph signal has this property — see Prob. II.23. It is claimed that the linear 
least-mean squares estimator of z(73) given z(71) and z(T2) (assuming T1 < T» < T3) is 


3(T3) = e 2057 T9) (72) 


That is, the estimator of a future value depends only on the most recent observation, z(T2). We can 
verify the validity of this claim by checking whether the above estimator satisfies the orthogonality 
condition. 
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So define the observation vector y = col{z(T1), z(T2)} and let © = z(T3). We can now 
evaluate the cross-correlation vector 


E(x- êy" = E [z(Ts) - e 075-7? (72) ]y* 


If the answer is zero then the orthogonality condition is satisfied and the estimator is optimal in the 
linear least-mean-squares sense, as claimed. Otherwise, the estimator is not optimal. Using the given 
auto-correlation function (4.6), it is easy to verify that 


E [«(19) ER gem y [ e 205-71) — e~%(T3-T1) e^ * (05-72) _ e^ 9 (75-12) 


[o 0] 


so that the estimator is optimal. 


o 
4.3 EXISTENCE OF SOLUTIONS 
Consider again the normal equations (3.9), namely, 
KoRy = Rey (4.7) 


In general, such linear systems of equations can have a unique solution, no solution at all, 
or an infinite number of solutions. This depends on the rank of the coefficient matrix R, 
and on how the right-hand side matrix Rzy relates to Ry. We shall explain these facts here 
and, in the process, provide the reader with an opportunity to get acquainted with some 
basic concepts from matrix theory and linear algebra. The reader may wish to review the 
material in Sec. B.2 for an overview of the concepts of nullspace, range space, and rank of 
a matrix. 


Unique solution. The argument will show that a unique solution K, exists if, and only 
if, the covariance matrix R, is positive-definite. Indeed, if R, > 0 then all its eigenvalues 
are positive and, consequently, R, is nonsingular. In this case, equation (4.7) will have a 
unique solution Ko given by Ko = Ray Ry 1. Conversely, assume a unique solution Ko 
exists to the normal equations (4.7) and let us establish that it must hold that R, > 0. 
Assume, to the contrary, that R, is singular. Then there should exist a nonzero row vector 
c* such that c" Ry = 0 or, equivalently, Ryc = 0. The vector c belongs to the nullspace of 
Ry, written as c € A (Ry). It is now easy to see that by adding to the rows of Ke any such 
vector c*, we obtain a matrix K/ that satisfies the same equations (4.7), i.e., Ki Ry = Ray. 
This contradicts the fact that Ke is unique so that R, must be positive-definite. 


Infinitely many solutions. We now show that the normal equations (4.7) will have in- 
finitely many solutions K, if, and only if, Ry is singular. One direction of the proof is 
obvious. Assume K, is one solution and that R, is singular. Then by adding to the rows 
of K, any vector c* (or any combination of vectors) from the nullspace of Ry, we obtain 
another solution K/ (as explained above). Hence, infinitely many solutions exist in this 
case. Conversely assume that many solutions exist. Let K, and K/ denote any two of 
these solutions. Then subtracting the equations Ko Ry = Rzy and K} Ry = Rey we obtain 
(Ko — K})Ry = 0, which means that R, is singular, since there is at least one nonzero 
row in Ko — K} and this row annihilates Ry. 


Existence of solutions. The only question that remains regarding K,R, = Rzy is 
whether solutions always exist. The answer is affirmative. That is, the normal equations 
(4.7) are always consistent. Recall that a linear system of equations Az = b is said to 
be consistent if the vector b lies in the range space of A, written as b € (A). This is 
equivalent to saying that a solution vector x exists. To establish this fact for the normal 
equations K,R, = Rey, we need to show that for any two random variables (a, y}, it 
always holds that the columns of Ry, lie in the column span of Ry, i.e., that there exists 
at least one matrix K% that satisfies E, K5 = Ryz, which by transposition is equivalent 
to KoRy = Rzy. This statement is obviously true when A, is nonsingular since then 
Ko = RyyR,'. The argument is more involved when Ry is singular. We first establish 
a preliminary result that provides an equivalent characterization for checking whether the 
normal equations Ko Ry = Rzy are consistent or not. 


Lemma 4.1 (Consistent equations) The equations K,R, = Rzy are consis- 
tent (i.e., there exists at least one solution Ko) if, and only if, 


c'Ry, =0 forany ce N(R,) 


that is, c* Ry, = 0 for any column vector c satisfying Ryc = 0. 


Proof: We first verify that the condition c* Ry; = 0 for all c € A/(R,) implies the existence of a 
matrix Ko satisfying Ko Ry = Rzy. Let us assume, to the contrary, that the equations Ko Ry = Ray 
are not consistent. This means that Rys does not lie in the column span of Ry, which in turn means 
that there should exist a vector c € A (Ry) that is not orthogonal to Rysz. This conclusion contradicts 
the assumption that c* Ry, = 0 for all c € N(R,). 

We now establish the converse statement, namely, that the existence of a Ko satisfying Ko Ry = 
Rey implies c* Ry, = 0 forall c € A (R,). This claim is obvious. For any such c, we have Ryc = 0 
and, hence, Ko Ryc = 0. This implies that Rzyc = 0 or, equivalently, c* Ry; = 0. o 


Therefore, in order to establish that the equations Ko Ry = Rzy are consistent, it is 
enough to prove that c*Ry, = 0 for all c € N(R,). So let us assume that this latter 
condition does not hold. This means that there exists at least one nonzero vector c such 
that c* Ry = 0 and c* Ry, # 0. This statement leads to a contradiction. Indeed, c*R, = 0 
implies that c* Ryc = c*(Eyy*)c = 0 or, equivalently, E|c*y|? = 0. We therefore have a 
zero-mean random variable c*y (because y is zero-mean itself) with zero variance. Using 
Remark A.1, we conclude that c*y is the zero random variable with probability one. It 
follows that 

c" Ry; £ c'(Eyz"*) - E(c'y)a* 20 
This contradicts the assumption c* Ry, # 0 and we conclude that the normal equations 
(4.7) are always consistent. 


Uniqueness of estimator. Interesting enough, regardless of which solution K, we pick 
(in the case when a multitude of solutions exist), the resulting estimator, $ = Ky, and the 
resulting m.m.s.e., Ry — K,R,K>, will always assume the same values. In other words, 
their values are independent of the specific choice of Ko. 

To see this, let us first establish the result for the m.m.s.e. Thus, let Ko and K; be two 
possible solutions of (4.7), i.e., KoRy = Rz, and K,R, = Rzy. Then the difference 
Ko — K} satisfies 

[Ko — Ki] Ry 20 (4.8) 
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Now denote the minimum mean-square errors by 
Ai = Rs — KoRyK* = R; — RKI and Ag = Rs- KIRIK? = Ry - Roy KS 
Subtracting the expressions for {A;, A2) we obtain 

A2-Ai = R,[K,—K)]] = K,R,[Ko - Ki)" = 0 


where in the second equality we substituted R,, by K,R,, and in the third equality we 
used (4.8). Therefore, A; = A». This means, as desired, that the value of the m.m.s.e. is 
independent of Ko. 

Let us now verify that no matter which K, we pick, the corresponding estimator $ will 
be the same. So let again K, and K/ be two possible solutions and define C = K} — Ko. 
Then from (4.8) we have CR, = 0. Let further = Koy and & = Kly. We want to 
verify that £ = $' with probability one. For this purpose, note that the condition CR, = 0 
implies CR,C* = 0 or, equivalently, E(Cy)(y*C*) = 0. We therefore have a zero-mean 
random variable C'y (because y is zero-mean itself) with a zero covariance matrix. Using 
Remark A.1, we again conclude that C'y is the zero random variable with probability one. 
It then follows from 

$&' = Kly = (Ko + C)y = ĉ + Cy 


that ê = &’ with probability one, as desired. We summarize our discussions in the follow- 
ing statement. 


Theorem 4.2 (Properties of the linear estimator) Consider the same setting 
of Thm. 3.1. Then the normal equations K,Ry, = Rzy that define the linear 
least-mean-squares estimator have the following properties: 


1. They are always consistent, i.e., a solution K, always exists. 
2. The solution Ke is unique if, and only if, Ry > 0. 
3. Infinitely many solutions K, exist if, and only if, Ry is singular. 


In case 3, regardless of which solution Ks is chosen, the values of the es- 
timator, 2 = K,y, and the m.m.s.e., (Rẹ — K,R,K7), remain invariant. 


4.4 NONZERO-MEAN VARIABLES 


Starting from Sec. 3.1, the discussion has focused so far on zero-mean random variables 
x and y. When the means are nonzero, we should seek an unbiased estimator for a of the 
form 

ĉ = Ky +b (4.9) 


for some matrix K and some vector b. As before, the optimal values for {K, b} are deter- 


mined through the minimization of the mean-square error, 


ail (4.10) 


where 2 = g — ĉ. 

To solve this problem, we start by noting that since the estimator should be unbiased 
we must enforce Ex = Z. Taking expectations of both sides of (4.9) shows that the vector 
b must satisfy z = Kğ + b. Using this expression for b, we can eliminate it from the 
expression for £, which becomes $ = Ky + (Z — Ky) or, equivalently, 


(è - 2) = K(y - 9) (4.11) 


This expression shows that the desired gain matrix K should map the now zero-mean 
variable (y — J) to another zero-mean variable (£ — Z). In other words, we are reduced 
to solving the problem of estimating the zero-mean random variable 2 — Z from the also 
zero-mean random variable y — jy. 

We already know that the solution Ks is found by solving 


K,R, = Roy (4.12) 


in terms of the covariance and cross-covariance matrices { Ry, Rey} of the zero-mean vari- 
ables {x — Z. y — g}, i.e., 


(I> 


Ry 2 E(y-Q)y-9), Rey Ê E(e-3)(y-gY 


Therefore, the optimal solution in the nonzero-mean case is given by 


£-—z-4K,.y- i] (4.13) 


with K, obtained from solving (4.12). 

Comparing (4.13) to the zero-mean case from Thm. 3.1, we see that the solution to 
the nonzero-mean case simply amounts to replacing x and y by the centered variables 
(x — £) and (y — jy), respectively, and then solving a linear estimation problem with these 
centered (zero-mean) variables. For this reason, there is no loss of generality, for linear 
estimation purposes, to assume that all random variables are zero-mean; the results for the 
nonzero-mean case can be deduced via centering. 

It further follows from (4.12) that 


K,E(y-3(w- -E(r-r(u-' 


or, equivalently, 
E[(æx - z) - Koly - g) (y -7 = 0 


so that the orthogonality condition (4.4) in the nonzero-mean case becomes 


or, equivalently, — $ L (y—9) 


where & = (z — 3) — (& — 7) = x — &. Moreover, the resulting m.m.s.e. matrix E%z* 


is equal to 
m.m.s.e. = Ry - KR, K} 


with R, = E(z — z)(z — z)*. 
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W. apply the linear estimation theory of the previous two chapters to the important 
special case of linear models, which arises often in applications. Specifically, we now 
assume that the zero-mean random vectors (a, y) are related via a linear model of the 


form 
z 


for some q x p matrix H. Here v denotes a zero-mean random noise vector with known 
covariance matrix, R, = E vv*. The covariance matrix of x is also assumed to be known, 
say, Exax* = Rz. Both (az, v) are uncorrelated, i.e., Ezv* = 0, and we further assume 
that R, > 0 and R, > 0. 


5.1 ESTIMATION USING LINEAR RELATIONS 


According to Thm. 3.1, when Ry > 0, the linear least-mean-squares estimator of x given 
yis 

ESSA, y (5.2) 
Because of (5.1), the covariances {Rz,, Ry} can be determined in terms of the given ma- 
trices {H, Rz, Ry}. Indeed, the uncorrelatedness of {æ, v) gives 


Hy = Eyy =E(He+v)(Ha+v)* = HR,H* + R, 
Ha = Eay* =Ea(Ha+v)* = R,H* 
Moreover, since R, > 0 we get R, > 0. The expression (5.2) for 4 then becomes 


$-R,H'[|R, + HRH*| ! y (5.3) 


This expression can be rewritten in an equivalent form by using the so-called matrix inver- 
sion formula or lemma. This formula is a very useful matrix theory result and it will be 
called upon several times throughout this book. The result states that for arbitrary matrices 
(A. B, C, D) of compatible dimensions, if A and C are invertible, then 


(A+ BCD)-! = A-1 — A-1 B(C-! + DA-1 B)- DA! (5.4) 


The identity can be verified algebraically; it essentially shows how the inverse of the sum 
A + BCD is related to the inverse of A. 
Applying (5.4) to the matrix [R, + HR,H ai in (5.3), with the identifications 


A-R, B=H, C-R, D=H* 
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we obtain 
& = R,H'(R;-R, H(R, -H'R,!H) H'Rj)y 
= (RÉ(RS-LH'URNSH)I-RQHTRQH)(RQ RHIRQH)SH'R,y 
= [R; -H'RSH]| HÜR;y 
where in the second equality we factored out the term (R7 ! -- H* R;! H) 1 H* Rz! y from 


the right. Hence, 


$—-[R; -H'RSQH]| H*Rjly (5.5) 


This alternative form can be useful in several contexts. Observe, for example, that when H 
is a column vector, the quantity (Ry + HR, H*) that appears in (5.3) is a matrix, while the 
quantity (R7! + H* R; 1 H) that appears in (5.5) is a scalar. In this case, the representation 
(5.5) leads to a simpler expression for &. In general, the convenience of using (5.3) or (5.5) 
depends on the situation at hand. 

It further follows that the m.m.s.e. matrix is given by 


m.m.se —EZ2Z' = E(r-id)r', since lê 
R,-|R;S SHÜRQH]| H'RQHR, 
[R;! + H*RZ H| 


where in the last equality we used the, matrix inversion lemma again. That is, 


1 


m.m.s.e. = [R;! + H* R;!H]- 


ER CHR e 
Theorem 5.1 (Linear estimator for linear models) Let (y, zx, v] be zero-mean 
random variables that are related via the linear model y = Hæ + v, for some 
data matrix H of compatible dimensions. Both a and v are assumed uncorre- 
lated with invertible covariance matrices, R, = Evv* and R, = Exa*. The 
linear least-mean-squares estimator of a given y can be evaluated by either 
expression: 


$ = R,H*[R,+HR,H*\'y = [RSHURSH| ARS y 
and the resulting minimum mean-square error matrix is 


m.m.s.e. = [Rz + H*RZ tH] 


Remark 5.1 (Centering for linear models) If the variables (2, v} in (5.1) were not zero-mean, 
say Ex = Zand Ev = 9, then the above results will still hold with @ replaced by (@ — Z) and y 
replaced by (y — y) = (y — HZ — v). Indeed, the covariance matrices { Re, Ry, Ry) will need to 
be defined accordingly as 


Re = E(xz—Zz)m—z), R,-E(v-v)v-o) 
Ry = E(y-W(y-3). Roy =E(x@-Z)(y- 9)" 
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and the uncorrelatedness of (2, v) will now amount to requiring E (a — z)(v — v)" = 0. Under 
these conditions, it will still hold that Ray = RaH* and Ry = Rs + H Rz H* , and the expressions 
for $ will become (cf. (4.13)): 

@ = Sc-[R; -H'Rj,H| H'R,(y- Hg i) 
2+ Re H* |HR;H* + Ro) (y — Hz — 0) 


o 


We now illustrate the application of Thm. 5.1 to several important examples including 
channel estimation, channel equalization, and block data estimation. 


5.2 APPLICATION: CHANNEL ESTIMATION 


Consider an FIR channel whose tap vector c is unknown; it is modeled as a zero-mean 
random vector with a known covariance matrix, Re = Ecc". The following experiment is 
performed with the purpose of estimating c, assumed of length M. The channel is assumed 
initially at rest (i.e., no initial conditions in its delay elements) and a known input sequence 
{s(i)}, also called a training sequence, is applied to the channel. The resulting output 
sequence {z(i)} is measured in the presence of additive noise, v(i), as shown in Fig. 5.1. 
The available measurements are 


y(i) = z(i) + v(i) (5.6) 
where v(i) is a zero-mean noise sequence that is uncorrelated with c. 


Assume we collect a block of measurements {s(-), y(-)} over the interval 0 < i € N. 
Then we can write in matrix form, say, for M = 3 and N = 6, 


y(0) s(0) v(0) 
y(1) s(1) s(0) v(1) 
y(2) s(2) s(1) s(0) v(2) 
y(3) | = | s(3) s(2) s(1) je + | v(3) (5.7) 
y(4) s(4) s(3) s(2) v(4) 
y(5) s(5) (4) s(3) v(5) 
y(6) s(6) s(5) s(4) v(6) 
yi(N+1)x1 H:(N+1)xM v(N+1)x1 


where we are further defining the quantities (y, H, v). Note that the data matrix H has a 
rectangular Toeplitz structure, i.e., it has constant entries along its diagonals. In addition, 
each row of H amounts to a state vector (also called a regressor) of the FIR channel. 


s(i) 


FIGURE 5.1 Channel estimation in the presence of additive noise. 


Specifically, the z—th row of H has the form 
[s(i s(i-1) ... s(i-—M+1) | 


which contains the input at time i, s(i), as well as the outputs of all delay elements in 
the channel. Thus, it is common to refer to the i—th row of H as the state vector or the 
regressor of the channel at the time instant i. 

The quantities (y, H } so defined are both available to the designer, in addition to the 
covariance matrices {R, = Ecc^, Ry = Evv*) (by assumption). In particular, if the noise 
sequence (v(i)) is assumed white with variance o2, then R, = Evv* = 071. With this 
information, we can estimate the channel as follows. Since we have a linear model relating 
y to c, as indicated by (5.7), then according to Thm. 5.1, the optimal linear estimator for c 
can be obtained from either expression: 


è= R.H* [HR.H* + R] y = [Rz! + H' RH] | H* R;!y | (5.8) 


5.3 APPLICATION: BLOCK DATA ESTIMATION 


Our second application is in the context of data (or symbol) recovery. We consider the 
same FIR channel as in Fig. 5.1, except that now we assume that its tap vector is known. 
For example, it could have been estimated via a prior training procedure as explained in 
the previous section. We denote this tap vector by c, with individual entries 


cê col{c(0), e(1),...,c(M — 1)] 


The channel is initially at rest and its output sequence, (z(i)), is again measured in the 
presence of additive noise, v(i), as shown in Fig. 5.2. The signals (v(-), s(-)} are assumed 
uncorrelated. What we would like to estimate now are the symbols {s(-)} that are being 
transmitted through the channel. Observe that, to be consistent with our notation, since c 
is deterministic and s(-) is random, we are now using a boldface letter for the latter and a 
normal font for the former. 

Suppose we collect a block of measurements {y(-)}, say, (N + 1) measurements over 
the interval 0 € i < N. Rather than relate the (y(-)) to the channel tap vector c through 
a data matrix, as we did in (5.7), we now relate the (y(-)) to the {s(-)} through a channel 
matrix. More specifically, assume again that M — 3 and N — 6 for illustration purposes. 


c (known) 


FIGURE 5.2 Block data estimation in the presence of additive noise. 
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Then we can write 


y(0) c(0) s(0) v(0 
y(1) c(1) c(0) s(1) v(1 
y(2) c(2) c(1) c(0) s(2) v(2 
y(3) | = c(2) c(1) e(0) s(3) | + | w(3 
y(4) c(2) c(1) c(0) s(4) v(4 
y (5) c(2) c(1) c(0) s(5) v(5 
y(6) c(2) c(1) c(0) s(6) v(6) 
—————————————————— —— 
y: (N41)x1 H:(N+1)x(N+1) s:(N+1)x1 v:(N+1)x1 


Note that the channel matrix H is now square Toeplitz of size (N + 1) x (N + 1); it also 
has a banded structure. The quantities {y, H} so defined are available to the designer, in 
addition to the covariance matrices ( Rs, Ry} (by assumption). In particular, if the data and 
noise sequences {s(-), v(-)) are white with variances (02,02), then R, = E ss* = o?I 
and R, = Evv* = c?I. With this information, we can estimate the symbols in the vector 
8 as follows. Observe again that we have a linear model relating y to the unknown symbol 
vector 8. According to Thm. 5.1, the optimal linear estimator for s can then be found from 
either expression: 


ô = R,H' |HR,H* + R] y = [Ry + F'RH|  H*Rjly (5.9) 


5.4 APPLICATION: LINEAR CHANNEL EQUALIZATION 


Our third application is in the context of linear channel equalization. More specifically, we 
generalize the discussion of Ex. 4.1. 

Consider again an FIR channel as shown in Fig. 5.2, with a known tap vector c of length 
M. Data symbols {s(-)} are transmitted through the channel and the output sequence, 
{z(2)}, is measured in the presence of additive noise, v(i). The signals {v(-), s(-)) are 
assumed uncorrelated. Due to channel memory, each measurement y(i) contains contribu- 
tions not only from s(i) but also from prior symbols since 


M-1 


y(i)  c(0)s(i) + V7 c(&)s(i — k) + v(i) (5.10) 


ISI 


The second term on the right-hand side is termed inter-symbol-interference (1S1); it refers 
to the interference that is caused by prior symbols in y(i). The purpose of an equalizer is 
to recover s(i). To achieve this task, an equalizer does not only rely on the most recent 
measurement y(i), but it also employs several prior measurements {y(i — k)}, say, for 
k — 1,2,..., L — 1. These prior measurements contain information that is correlated with 
the ISI term in y(i). For example, the expression for y(i — 1) is 


M-1 
y(i—-1)- c(0)s(i—1) + J` c(k)s(i -1—k) + v(i- 1) 


k=1 


ISI 


and the ISI term in it shares several data symbols with the ISI term in y(i). It is for this 
reason that prior measurements are useful in eliminating (or reducing) ISI. In this section, 


we are interested in an equalizer structure of the form shown in Fig. 5.3. The equalizer is 
chosen to have an FIR structure with L coefficients, so that for each time instant 7, it would 
employ the L observations 


y(i) 
y(i - 1) 
Vi = yli- 2) 
y(i-L+1) 
in order to estimate s(i — A) for some integer delay A > 0. We remark that we shall 


frequently deal with time sequences in this book. So when we write y(i) we are referring 
to the value of the time sequence y(-) at time i. Not only that, but the notation y(-), 
with parentheses, also means that y(-) is a scalar. This is because for vector-valued time- 
sequences, we shall instead write y;, with a subscript rather than parentheses, to refer to 
its value at time i. In other words, we shall use parenthesis for time-indexing in the scalar 
case, e.g., {y(z)}, and subscripts in the vector case, e.g., {y; }. 

It is useful to express the observation vector y; in terms of the transmitted data as 
follows. Assume for the sake of illustration that L = 5 (i.e., an equalizer with 5 taps) and 
M = 4 (a channel with four taps). Then we have 


s(i) 
yli) c(0) c(1) e(2) c(3) RO 
y(i — 1) c(0) c(1) e(2) c(3) 313 
yi-2) | = c(0) c(1) c(2) «(3) d 
S628) «(0) cQ) «(2) (3) We 
yli - 4) «0) cQ) «2) e] | € 

ddr MEC c EE a 
y,sLxi H:Lx(L+M-1) s(i—T7 
——— 
Si(L--M)-1x1 
v(i) 
v(i — 1) 
+ v(i-2) 
v(i— 3) 
v(i—4) 
a 
v;:Lx1 


Note that the observation vector y; has L entries, the data vector s; has L + M — 1 entries, 
and the channel matrix is now L x (L+ M — 1). We thus find that there is a linear relation 
between the vectors {y,, s;) and this relation can be used to evaluate the covariance and 
cross-covariance quantities that are needed to estimate s(i — A) from y; in the linear least- 
mean-squares sense. Specifically, let us write 


èli- A) = w*y, (5.11) 


for some column vector w to be determined. We denote the optimal choice for w by w°; the 
entries of w?* will correspond to the optimal tap coefficients for the equalizer. According 
to Thm. 3.1, w°* is given by 

w™ = RR, (5.12) 
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Linear equalizer 


FIGURE 5.3 Linear equalization of an FIR channel in the presence of additive noise. 


where 
Ry £ Es(i-A)ys (1x L) (5.13) 


denotes the cross-covariance vector between s(i — A) and y;, and 
A * 

Ry = Eyy; (LxL) (5.14) 
denotes the covariance matrix of the observation vector, y;. Observe that since the pro- 
cesses (s(-), y (-)) are jointly wide-sense stationary, the quantities {R,,, Ry} are indepen- 
dent of i. In order to determine ( R,,, Ry} we resort to the aforementioned linear model 
relating {y,, Si, vi, H}. To begin with, 

Ry = E (Hs; + vi)(Hsi T vi)" = HR,H*+R, 


where 


R, & Esjss ((L-M-1)x(L-M-1) and R, 2 Evv? (Lx L) 


and 
Rey = Es(i — A)(Hs; + vi) = (Es(i- A)s}) H" 


since {v(-), s(-)) are uncorrelated. The value of E s(i — A)s7 depends on the assumed 
correlation between the transmitted symbols. If the {s(-)} are independent and identically 
distributed with variance o?, then R, = 071 and 


Es(i-A)sf = Es(i-A)|s'(j s*(i-1) ... s'(i-7)] 
= [0 15 0 02 0 .. 0] 


with A leading zeros. We continue with this assumption on the data {s(-)} for simplicity 
of presentation. But it should be clear that the above derivation applies even for correlated 
(but stationary) data. In a similar vein, we assume that the noise sequence {v(-)} is white 
with variance o? so that R, = o2I. Again, the development applies even for correlated 
(but stationary) noise. We therefore find that 


R,=o?7HH*+o71 and Ry=[0 ... iss (5.15) 


Observe that Rs, is proportional to the (A + 1)—th row of H*. Substituting (5.15) into 
(5.12) we arrive at the following expression for the equalizer tap vector: 


w^* = o?e% H* (c2HH* +071) (5.16) 
where ea denotes the basis vector with A leading zeros, 


ea S [0 ... OE ss o] (with A leading zeros) 


Moreover, according to Thm. 3.1, the resulting minimum mean-square error is given by 
m.m.s.e = 02 — Rsy R, ' Rys and, hence, 


m.m.se = o? (1— c2e4 H* Rj1 Hea) (5.17) 


An intuitive way to understand the usefulness of using a nonzero delay A is to recall 
that channels have group delays. Loosely speaking, the group delay of a channel is a mea- 
sure of the amount of delay that a signal undergoes when transmitted through the channel. 
For this reason, the channel output, z(i), would be more correlated with a delayed version 
of s{z), than with s() itself. It therefore makes sense to use the channel output to estimate 
a delayed replica of the input. 


Example 5.1 (Numerical illustration) 


Let us use the above results to re-examine Ex. 4.1, where L = 3, M 22, A = 0, (c(0),c(1)) = 
{1,0.5}, and (02, 02) = (1, 1). Therefore, for this case, we have 


1 0.5 
H= 1 0:5 
1 0.5 
so that from (5.15), 
9/4 1/2 0 
0 1/2 9/4 


Using (5.16) and (5.17) we get 
w^ = | 0.4688 —0.1096 0.0244 and m.m.s.e = 0.5312 


which are the same results from Ex. 4.1. 


5.5 APPLICATION: MULTIPLE-ANTENNA RECEIVERS 


Let us reconsider Ex. 4.1 with N noisy measurements, 
y(i) = x + v(i), i=0,1,...,N—1 


of some zero-mean random variable æ. Let y = col(y(0), y(1),..., y(/N — 1)) denote 
the observation vector. In Ex. 4.1 we evaluated the linear least-mean-squares estimator of 
x given y by computing Rz, Kj 1 explicitly. Here we evaluate & by showing first that y 
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and æ are related through a linear model as in (5.1), and then using (5.5). For generality, 
we shall assume that the variances of x and v(i) are c2 and c2, respectively. In Ex. 4.1, 
we used o2 = o2 = 1. 

Introduce the N x 1 column vectors: 


v Ê col{v(0),v(1),...,0(N -1)}, h Ê col{1,1,...,1} 


Then y = ha + v and the covariance matrix of v is 721. We now obtain from (5.5) that 


1 


à E lx 
g = [1/02 +A h/o?) h y/o? = N+ sin 


y(i) 
i=0 


where SNR = o2/62. Observe that we are not dividing by the number of observations 
(which is N) but by N -- 1/SNR. We shall comment on the significance of this observation 
in the next chapter. 

Compared with the solution of Ex. 4.1, observe how the expression for 4: is obtained 
here more immediately. We did not explicitly form the N x N covariance matrix R, and the 
1 x N cross-covariance vector Rzy, and then evaluate Rey Ry l Instead, we used the linear 
relation y = ha + v and the formula (5.5). In this formula, the term (R7! + H* R; 1H) 
is a scalar and its inversion is trivialized. This is in contrast to the term (R, + HR; H*), 
whose inverse appears in the alternative formula (5.3) — this term is equal to R,. Again, 
if we interpret the (y(k)) as noisy measurements that are collected at multiple antennas 
as a result of transmitting a signal æ over additive noise channels, then we find that the 
expression for ĉ suggests the optimal linear receiver structure shown in Fig. 5.4. 

In Probs. IL.25 and IL.26 we extend this result further by showing how to incorporate 
channel gains into the design procedure. Specifically, we pursue receiver structures that 
are optimal according to two criteria: in one criterion, the SNR at the output of the receiver 
is maximized (resulting in the so-called maximal-ratio-combining technique), while in the 
second criterion the same minimum mean-square-error design of the current example is 
used. 


v(0) 
Luo i l 
NFI/SNR 
y(1) d 
x i Y z 
"E À 
y(N — Mx 
v(N — 1) 


FIGURE 5.4 An optimal linear receiver for recovering a symbol æ transmitted over additive-noise 
channels from multiple-antenna measurements. 


CHAPTER 6 


Constrained Estimation 


l. Sec. 5.1 we studied the problem of estimating a random variable from a noisy obser- 
vation y that is related to x via the linear model 


y =Hzr+v (6.1) 
where H is a known data matrix and v is some disturbance, with x and v satisfying 
Ex=0, Ev=0, Eaau*=R,, Evv*= R, Exv* =0 (6.2) 


The linear least-mean-squares estimator of x given y was found to be given by either 
expression 


@ = RH” [Rs + HRH ' y = [RSH'USH| HURSy —(63) 


with the right-most expression valid whenever R, > 0 and R, > 0. 
In Sec. 5.5, we applied these results to a simple, yet revealing example. Given N 


measurements (y(0), y(1),..., y(N — 1)} of a random variable æ with variance o?, 
yli) = z-v(i), i =0,1,..., N -1 
i.e., given 
y(0) 1 v(0) 
v - z+ ee (6.4) 
y(N — 1) 1 v(N — 1) 


we estimated x from the {y(i)} and found that 
: 1 ; 
$-———— x y(i) (6.5) 


where SNR = o2/c2. In (6.4), the variable æ is assumed to have been initially selected at 
random and then N noisy measurements of this same value are made — see Fig. 2.9. The 
observations are subsequently used to estimate x according to (6.5). 

But what if we consider a different model for x, whereby it is assumed to be a constant 
of unknown value, say, x, rather than a random quantity? How will the expression for & 
change? The purpose of this chapter is to study such estimators. Specifically, we shall now 
consider linear models of the form 


y-Hz-cv (6.6) 


87 


Adaptive Filters, by Ali H. Sayed 
Copyright © 2008 John Wiley & Sons, Inc. 


88 


CHAPTER 6 
CONSTRAINED 


ESTIMATION 


where, compared with (6.1), we are replacing the boldface letter x by the normal letter 
x (remember that we reserve the boldface notation to random variables throughout this 
book). The observation vector y in (6.6) continues to be random since the disturbance v is 
random. Moreover, any estimator for z that is based on y will also be a random variable 
itself. Given (6.6), we shall then study the problem of designing an optimal linear estimator 
for x of the form $ = Ky, for some K to be determined. It will turn out that, for such 
problems, K is found by solving a constrained least-mean-squares estimation problem, 
as opposed to the unconstrained estimation problem (3.21). Once the estimation problem 
is solved, we shall then apply it in Secs. 6.3-6.5 to three examples: channel and noise 
estimation, decision-feedback equalization and antenna beamforming. 


6.1 MINIMUM-VARIANCE UNBIASED ESTIMATION 


Thus consider a zero-mean random noise variable v with a positive-definite covariance 
matrix R, = Evv* > 0, and let y be a noisy measurement of Hz, 


6D 


where z is the unknown constant vector that we wish to estimate. The dimensions of the 
data matrix H are denoted by N x n and it is further assumed that N > n, 


8 


That is, H is assumed to be a tall matrix so that the number of available entries in y is at 
least as many as the number of unknown entries in x. Note that we use the capital letter 
N for the larger dimension and the small letter n for the smaller dimension. We shall use 
this convention in the book whenever it is relevant to indicate how the row and column 
dimensions of a matrix compare to each other. 

We also assume that the matrix H in (6.7) has full rank, i.e., that all its columns are 
linearly independent and, hence, 


rank(H) = n (6.9) 


This condition guarantees that the matrix product H*H is invertible (in fact, positive- 
definite — recall Lemma B.4). It also guarantees that the product H* R; 1 H is positive- 
definite — see expression (B.1). For the benefit of the reader, Sec. B.2 reviews several 
basic concepts regarding range spaces, nullspaces, and ranks of matrices. 


Problem Formulation 


We are interested in determining a linear estimator for x of the form 2 = Ky, for some 
n x N matrix K. The choice of K should satisfy two conditions: 


1. Unbiasedness. First, the estimator @ should be unbiased. That is, the choice of 
K should guarantee Eĉ = x, which is the same as KEy = x. But from (6.7) we 
have Ey = Hz so that K should satisfy K Hx = z, no matter what the value of x 
is. This condition means that K should satisfy 


(610 


Note that K H is n x n and is therefore a square matrix. 


2. Optimality. Second, the choice of K should minimize the covariance matrix of the 
estimation error, = x — x. Using the condition KH = I, we find that 


&=Ky=K(Hr+v)=KHr+Kv=2+Kv 


so that € = — Kv. This means that the error covariance matrix, as a function of K, 
is given by 
Egt" =E(Kvvu*k*) = KRK" (6.11) 


Combining (6.10) and (6.11), we conclude that the desired K is found by solving the 
following constrained optimization problem: 


min KRK*  subjectto KH =1 (6.12) 


The estimator = Koy that results from the solution of (6.12) is known as the minimum- 
variance-unbiased estimator, or m.v.u.e. for short. It is also sometimes called the best 
linear unbiased estimator (BLUE). 


Interpretation and Solution 
Let ,7 ( K) denote the cost function that appears in (6.12), i.e., 


J(K) 2 KR,K* 


Then problem (6.12) means the following. We seek a matrix K, satisfying KoH = I such 
that 
J(K)-J(Ke) >0 forall K satisfying KH =I 


There are several ways of determining Ko. We choose to use the already known solution 
of the linear estimation problem (cf. Sec. 5.1) in order to guess what the solution A, for 
(6.12) should be. Once this is done, we shall then provide an independent verification of 
the result. 

Thus recall, as mentioned in the introduction of this chapter, that for two zero-mean 
random variables (x, y) that are related as in (6.1), the linear least-mean-squares estimator 
of x given y is (cf. the second expression in (6.3)): 


$-(R,;—-H'R,H)!H'R;ly 
Now assume that the covariance matrix of a has the particular form R} = al, with a 
sufficiently large positive scalar a (i.e., œ — oc). That is, assume that the variance of each 


of the entries of æ is infinitely large. In this way, æ can be "interpreted" as playing the role 
of some unknown constant vector, x. Then the above expression for & reduces to 


$-(H'Rj,H)HR,y 
This conclusion suggests that the choice 
K,—-(H'R;,!H)H*R,! 


solves the problem of estimating the unknown vector z from model (6.7). We shall now 
establish this result more directly; the result is known as the Gauss-Markov theorem. 
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Theorem 6.1 (Gauss-Markov Theorem) Consider the linear model y = Hz + 
v, where v is a zero-mean random variable with positive-definite covariance 
matrix R,, and z is an unknown constant vector. Assume further that H is 
a full-rank N x n matrix with N 2 n. Then the minimum-variance-unbiased 
linear estimator of x given y is $ = Ky, where 


Ko = (H*R;!H)'H'R,! 


Moreover, the resulting cost is m.m.s.e. = (H*R; 1 H)-. 


Proof: For any matrix K that satisfies KH = I, itis easy to verify that 
J(K)-KR,K" = (K — K.)R.(K - Ko) + K,R.K; (6.13) 
This is because 
KR,K; = KR,[R; H(H' RH) ] = KH(H'Rj/! H) | Z(H'Rj! Hy) | 


Likewise, KoR, K3 = (H*R;1H)~?. Relation (6.13) expresses the cost 7(K) as the sum of 
two nonnegative-definite terms: one is independent of K and is equal to Ko Rs Kå, while the other 
is dependent on K. It is then clear, since R, > 0, that the cost is minimized by choosing K = Ko, 
and that the resulting minimum cost is 


J(Ko) = (H* R;! H)! 


Note further that the matrix Ko in the statement of the theorem satisfies the constraint K,H = I. 


o 


Remark 6.1 (Constrained optimization) Sometimes in applications (see Secs. 6.4 and 6.5), op- 
timization problems of the form (6.12) arise without being explicitly related to a minimum-variance- 
unbiased estimation problem (as in the statement of Thm. 6.1). For this reason, we also state the 
following conclusion here for later reference. The solution of a generic constrained optimization 
problem of the form 


23 KR,K" subjecto KH =I and R, >0 (6.14) 


is given by 
K,-—(H'R;,HH)H'*R,! 
with the resulting minimum cost equal to 


minimum cost = (H*R;! H) ^! (6.15) 


6.2 EXAMPLE: MEAN ESTIMATION 


Let us reconsider the example of Sec. 5.5, where we assumed that we are given N mea- 
surements 


yli) 22 v(i), i-0,1,...,N -1 


of the same random variable æ with variance c2. The noise sequence v(i) was further 
assumed to be white with zero mean and variance di The linear least-mean-squares esti- 
mator (l.l.m.s.e.) of x given the (g(i)) was found to be (cf. (6.5)): 


xig SM 


Llimse = m 
N+ six c 


where SNR = 02/02. 
Now assume instead that we model x as an unknown constant, rather than a random 
variable, say, 
yli) 2 zc v(i), i1—0,1,...,N -1 (6.16) 


In this case, the value of z can be regarded as the mean value of each y(i}. We collect the 
measurements and the noises into vector form, 


^ ^ 
y = col(y(0,9(1,...,9y(N-1), v = col{v(0), v(1),...,v(N - 1)} 
and define the data vector 
h = col(1,1,...,1) 


Then 
y=hetvu 


with R, = Evv* = 071. Invoking the result of Thm. 6.1 with H = h, we conclude that 
the optimal linear estimator, or the m.v.u.e., of x is 


(6.17) 


This result is simply the sample-mean estimator that the reader may be familiar with from 
an introductory course on statistics. Comparing the expressions for ise and mvue we see 
that we are now dividing the sum of the observations {y(z)} by N, and not by N -- 1/SNR. 
This modification guarantees that the estimator &mvue is truly unbiased. 


6.3 APPLICATION: CHANNEL AND NOISE ESTIMATION 


We reconsider the channel estimation problem of Sec. 5.2, except that now the channel 
tap vector is modeled as an unknown constant vector, c, rather than a random vector, c, as 
shown in Fig. 6.1. 

By repeating the construction of Sec. 5.2 we again obtain (cf. (5.7)): 
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where we are defining the quantities (y, H, v) and where H is (N + 1) x M. Using the 
result of Thm. 6.1, we find that the optimal estimator of c is now given by 


Emvue = (H'R;!H)!H'R,y (6.18) 


where R, is the covariance matrix of v. This result is different from the linear least-mean- 
squares estimator (1.I.m.s.e.) found in Sec. 5.2 (see (5.8)), namely, 


Climse = Ar. + H* RH] H'R,ly 


which requires knowledge of the covariance matrix R, = Ecc" when c is modeled as a 
random variable. The estimator (6.18) requires knowledge of only (H, R,, y). Actually, 
if the noise sequence (v(i)) is modeled as white with variance c2, then R, = c?I and 
R, would end up disappearing from the expression for Ĉmvue. Specifically, (6.18) would 


become 
Emvue = (H*H)^!H*y (6.19) 


Itis worth remarking that expression (6.19) has the form of a least-squares solution; which 
we shall study in great detail in Part VII (Least-Squares Methods). 

Note from (6.19) that we do not need to know c2; the estimator is now only dependent 
on the available data (namely, the measurements {y(i)} and the data matrix H). If desired, 
we can estimate c? itself as follows. Since 


v(i)-y()-|s() s(i-1) ... s(i-M-1)]e 


an estimator for o2 would be 


e qo : 
= ribl- 9 s(i—1) ... SG = M +1) ]êmvuel 
i=0 


1 A 
cl = NU ly — Hemvuell? (6.20) 
Expressions (6.19) and (6.20) are often used in practice to perform channel and noise 
variance estimation. Later in Prob. VII.19 we shall show that the alternative estimator 


= 1 . 
oj ggi ml” Hemel” 


FIGURE 6.1 Channel and noise estimation. 


with (N + 1 — M) instead of (N + 1), is an unbiased estimator for c2. 


6.4 APPLICATION: DECISION FEEDBACK EQUALIZATION 


Our second application is in the context of channel equalization, which we already encoun- 
tered in Sec. 5.4 while studying linear equalizers. In this section, we extend the discussion 
to decision-feedback equalizers. 

Thus consider an FIR channel with a known column tap vector c of length M (i.e., with 
M taps), say, with transfer function 


C(z) = ¢(0) + c(1)27! +... c e(M — 1) 04-0 


Data symbols {s(-)} are transmitted through the channel and the output sequence is mea- 
sured in the presence of additive noise, v(i). The signals {v(-), s(-)) are assumed uncorre- 
lated. Due to the channel memory, each measurement y(i) contains contributions not only 
from s(i) but also from prior symbols, since 


M-1 


yli) = c(0)s(i) + J` c(k)s(i — k) + v(i) 


k=1 


ISI 


The second term on the right-hand side describes the inter-symbol-interference (ISI); it 
refers to the interference that is caused by prior symbols. The purpose of an equalizer is to 
combat ISI and to recover s(i) from measurements of the output sequence. 

As was discussed in Sec. 5.4, in order to achieve this task, a linear equalizer employs 
current and prior measurements {y(i — k)), say, fork = 0,1,..., L — 1. This is because 
prior measurements contain information that is correlated with the ISI term in y(i) and, 
therefore, they can help in estimating the interference term and removing its effect. Of 
course, if possible, it would be preferable to use the prior symbols (s(i — 1), s(i — 2),...) 
themselves in order to cancel their effect from y(i) rather than rely on the prior measure- 
ments {y(i — 1), y(i — 2),...). 

Decision-feedback equalizers (DFE) attempt to implement this strategy and are there- 
fore better suited for channels with pronounced ISI. In addition to using an FIR filter in the 
feedforward path, as in linear equalization, a DFE employs a feedback filter in order to feed 
back previous decisions and use them to reduce ISI. The DFE structure is shown in Fig. 6.2 


FIGURE 6.2 A decision-feedback equalizer. It consists of a feedforward filter, a feedback filter, 
and a decision device. 
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for estimating a delayed version of s(i), with the transfer functions of the feedforward 
and feedback filters denoted by {F (z), B(z)}, respectively. It is seen from the figure that 
the input to the feedback filter comes from the output of the decision device, denoted by 
&(i — A). The purpose of this device is to map the estimator (i — A), which is obtained 
by combining the outputs of the feedforward and feedback filters, to the closest point in 
the symbol constellation. Now in linear equalization, the feedforward filter reduces ISI by 
attempting to force the combined system C(z) F(z) to be close to 


C(z)F(z) x 274 
In general, this objective is difficult to attain, especially for channels with pronounced ISI, 
and C(z)F(z) will have a nontrivial impulse response sequence (we say that C(z) F(z) 


will have trailing inter-symbol interference). The purpose of the feedback filter in a DFE 
implementation is to use prior decisions in order to cancel this trailing ISI. 


Equalizer Design 
Assume the feedforward filter has L taps and denote its transfer function by 


F(z) = f(0) + f(1)27! +... + f(L — 1)2- 0-9 


with coefficients {f(i)}. Likewise, assume the feedback filter has Q taps with a transfer 
function of the form 


B(z) = —b(1)z7! — b(2)2-? —... — b(Q)z-9 


with coefficients denoted by {—6(z)} for convenience. Note that this filter is strictly causal 
in that it does not have a direct path from its input to its output (i.e., b(0) = 0). This is 
because previous decisions are being fed back through B(z). 

The criterion for designing the equalizer coefficients { f (i), b(i)) is, as usual, to mini- 
mize the variance of the error signal, 


a(i— A) = s(i— A)  &(i — A) 


In so doing, the designer expects that (i — A) will be sufficiently close to s(t — A) so 
that the decision device would be able to map 8(i — A) to the correct symbol in the signal 
constellation. Therefore, the { f (i), b(z)} will be determined by solving 


min E ls(i - A)? 


{ £0), £(2,..., f(E- 1) j (6.21) 
b(1), (2), ..., b(Q) 


The presence of the decision device makes (6.21) a nonlinear optimization problem. This 
is because 8(i — A) will be a nonlinear function of the measured data (y(i)). In order to 
facilitate the design of {F (z), B(z)}, itis customary to assume that 


The decisions {&(i — A)} are correct and equal to (s(i — A)} (6.22) 
That is, we assume that the decision device gives correct decisions. This assumption high- 


lights one difficulty with decision-feedback equalization. Erroneous decisions can happen 
and they get fed back into the equalizer through the feedback filter. Longer feedback filters 


tend to keep these errors longer within the equalizer and can cause performance degrada- 
tion especially at low signal-to-noise ratios. Still, in general, decision-feedback equalizers 
tend to outperform linear equalizers (see Prob. II.40). 

To solve (6.21) we first examine the dependence of the error variance on the unknown 
coefficients { f (1), b(1)). From Fig. 6.2 we have 


ŝli- A) = [f(0y() + Oyi- 1)... fL-DYyG-L+1)] 
-[b(1)s(i - A — 1) -b(2)s(i - A — 2) +... +.(Q)8(i - A — Q)] 


where we used assumption (6.22) to replace (7) by s(j). We can rewrite this expression 
more compactly in vector form as follows. We collect the coefficients of F(z) into a row 


vector: 


and the coefficients of — B(z) into another row vector with a leading entry that is equal to 


one, 
[1 b(1) b(2) 


We also define the following column vectors of observations and data symbols: 


y(i) s(i- A) 
yli- 1) s(i—A-1) 
y, 8| wi-2) |, $4 — | 8(0-A-2) (6.23) 
-Lei s(i- A - Q) 
—— I — 
Lx1 (Q41)x1 


and denote their covariances and cross-covariances by 


ZDFIFIEIER] 

Vi Vi Rys Ry 

where Rs is (Q + 1) x (Q +1) and Rey is (Q + 1) x L. We assume that the processes 
{s(-), y(-)} are jointly wide-sense stationary so that the quantities {Rsy, Ry, Rs} are in- 
dependent of i. We further assume that the covariance matrix R is positive-definite and, 


hence, invertible. The positive-definiteness of R guarantees that both R, and the Schur 
complement of R with respect to Ry are positive-definite matrices as well (see Sec. B.3), 


ie. 
R, »0 and Rs Ê R, -— ReyRy'Rys > 0 (6.24) 


where we are denoting the Schur complement by Ry. Hence, { Ry, Rs} are also invertible. 
With these definitions, the error signal 8(2 — A) can be written as 


s(i—A) = s(i—A)—8(i-—A) = b*sa — fry, 
so that the optimization problem (6.21) becomes 


min E |b*s, — fy? (6.25) 


We shall denote the optimal vector solutions by f>,, and b5,,. Rather than minimize the 
variance of b^ s4 — f* y; simultaneously over ( f, b), we shall minimize it over one vector 
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at a time. Thus assume that we fix the vector b and let us minimize the error variance 
over f. To do so, we introduce the scalar a = b*sa so that the error signal becomes 
S8(i— A) = a— f*y;. In this way, (i — A) can be interpreted as the error that results from 
estimating o from y; through the choice of f. In other words, we are reduced to solving 


| 2 


2 E ja — f*yi where a= b*s, 


which is a standard linear least-mean-squares estimation problem. From Thm. 3.1, we 
know that the optimal choice for f is 


fn Shey hy SO Rag Re (6.26) 
where we used the fact that 
Ray = Eay; = Eb* say; = b* Rey 
The resulting minimum mean-square error is, again from Thm. 3.1, 
A 
mmse = Ela— fna 
= Ra- Ray Ry Ra 
= d'Rb-URSQRQ Ry 
= Lb[R,- RR; Rb 
= Lb'HBsb (6.27) 
Substituting this expression into (6.25), we find that we now need to solve 


min b* Rgb (6.28) 


But recall that the leading entry of b is unity, so that (6.28) is actually a constrained problem 


of the form 
min U'Réb — gubjectto b*eo = 1 


where ey is the first basis vector, of dimension (Q + 1) x 1, 
eo S col(1,0,0...,0) 


Using the result stated in Remark 6.1, we find that the optimal choice of b is 


The term that appears in the denominator is the (0,0) entry of R3 !, while the term in the 
numerator is the first row of R;*. This means that the optimal vector dept is obtained by 
normalizing the first row of Re to have a unit leading entry. Substituting the above ex- 
pression for bj, into b* F5b we find that the resulting m.m.s.e. of the original optimization 
problem (6.21) is 

1 


= 6.29 
ej R; 1 €g ( ) 


m.m.s.e. — 


In summary, under assumption (6.22) that the decisions {8(i — A)} are correct, the 
optimal coefficients ( f (i), b(7)} of the DFE can be found as follows: 


Foot = Spt Rey Ry (6.30) 


The entries of ( f5,,, pt} provide the desired tap coefficients {b(7), f (1). 


Using the Channel Model 


The expressions (6.29)-(6.30) for (55... £5... m.m.s.e.) are in terms of the covariance and 
cross-covariance matrices { Rg, Rey, Ry}, which can be evaluated from the channel model 
Cz) and from the given statistical information about (s(-), v(-)). To do so, we proceed 
as in Sec. 5.4. 

We first express the observation vector y; in terms of the transmitted data. Assume for 
the sake of illustration that L = 5 (i.e., a feedforward filter with 5 taps) and M = 4 (a 


channel with four taps). Then we can write 


s(i) 
" «0) e(1) e2) «(3 ED 
yi — 1) c(0) c(1) c(2) c(3) 2 - 5 
y(i - 2) ex c(0) c(1) c(2) e(3) s(i — 4) 
yli — 3) c(0) c(1) c(2) «(3) s(i — 5) 
y(i — 4) c(0) c(1) ¢(2) «(3) s(i — 6) 
yoLxl H:Lx(L+M-1) s(i — 7) 
—— 
S:(L--M)-1x1 
v(i) 
v(i — 1) 
+ | v(i-2) 
v(i — 3) 
v(i — 4) 
viLXx1 
That is, 


where, for general (L, M), 


s(i) v(i) 
s(i — 1) v(i — 1) 
a, 5 s(i — 2) v; Ê v(i — 2) (6.31) 
slc M ue GERED 


and H is the L x (L + M — 1) channel matrix. We therefore find that there is a linear 
relation between the vectors {y;, $; } and this relation can be used to evaluate R, as 


Ry = E (Hs; + v:)(Hs; +v:) = HR,H' + Ry 
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where 


R, 2 Ess; (L-M-1)x(L-M-1) and R, ê Evjvt (LxL) 


—— 


Likewise, 
Rey = Esa(Hs;+;)" = (Esas;) H* 


since {v(-), s(-)? are uncorrelated. We still need to evaluate E sa s7, where (54, 8;) are 
defined by (6.23) and (6.31) in terms of the transmitted symbols. Of course, the value of 
E sas} depends on the assumed correlation between the transmitted symbols. 

It is common that A be chosen such that all the entries of sA fall within the entries of 
8j. This condition requires the channel and filter lengths, as well as the delay A, to satisfy 


(632 


With this condition, if the {s(-)} are assumed to be independent and identically distributed 
with variance c2, then it can be seen that 


Esas; = [0 .. 0 oig} 0 ... 0] ((Q+1)x(L+M -—1)) 


with A leading zero columns, followed by a (Q + 1) x (Q + 1) identity matrix scaled by 
a2, followed (or not) by zero columns. Even if A exceeds the bound in (6.32), we can still 
evaluate E sas and complete the calculations. We can express the above E sas} more 
compactly as 


Esas; = | Coxa los: 0] 
Likewise, 
R, = o?Iqui and — Rg = 714-1 


We continue with the assumption of i.i.d. symbols {s(-)} for simplicity of presentation. 


But it should be noted that the derivation applies even for correlated (but stationary) data. 


In a similar vein, we assume that the noise sequence {v(-)} is white with variance c2 so 


that R, = c2I. Again, the development applies even for correlated (but stationary) noise. 
We thus find that 


R,—-o2HH*-o?lQ, and Rey=[ OQsyxa ožig} 0]H* | (633) 


and 
Rs = œI- R(c2HH* &c?Ij)! Rys 


This latter expression can be rewritten, by virtue of the matrix inversion lemma (5.4), as 


2 
s 


=i 
Rs =o (3 + aum) a (6.34) 
c c2 


where 
=| Coxa loy 0] 


Expressions (6.33) and (6.34) can now be used with (6.29)—(6.30) to determine the optimal 
equalizer coefficients and the resulting m.m.s.e. 


Example 6.1 (Numerical illustration) 


Let us reconsider Ex. 4.1 and design a DFE equalizer rather than a linear equalizer for the channel 
C(z) = 1+40.5z7!, for which M = 2. We select a feedforward filter with 3 taps (i.e., L = 3) and 
a feedback filter with one tap (i.e., Q = 1). We also select A = 1. The resulting structure is shown 
in Fig. 6.3. For this example, o2 = 1, 02 = 1, and 


v(i) 


FIGURE 6.3 A DFE structure for the channel 1 + 0.5277. 


so that from (6.33) and (6.34) 


9/4 1/2 0 
"I A al [o A oJ- [e E 
0 1/2 9/4 
Using (6.30) we obtain 
by = | 1.0000 02354], feo = | 0.1176 0.4706 0.0000 | 


and the resulting m.m.s.e. is 
m.m.s.e. = bopt Röbopt = 0.4705 
That is, 
Bopt(z) = —0.2354z ! and = Fopi(z) = 0.1176 + 0.470627! 


Had we selected instead A = 0, as we did in Ex. 4.1, then the only quantity that changes is the 
cross-covariance Rsy, which becomes 


1 0 0 


Rey = 
= 05 1 0 


| — r= | 0.5312 | 


—0.1248 0.4992 
and, therefore, 


be - [1 0.2500], fó = [05000 0.0000 0.0000 | 
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That is, Bogc(z) = —0.2527* and Fopt(2) = 0.5. The resulting m.m.s.e. in this case is 
m.m.s.e. = b6pfsbopt = 0.5000 


We see that for this example with A = 0, the DFE results in a smaller mean-square error than the 
linear equalizer designed in Ex. 4.1, which resulted in m.m.s.e. = 0.5312. A more noticeable differ- 
ence in performance between decision-feedback equalizers and linear equalizers can be observed for 
channels with more pronounced inter-symbol interference. The performance of DFEs is examined 
in greater detail in a computer project at the end of this part. e 


Formulation as Linear Estimation 
The design of a DFE can alternatively be pursued by formulating an unconstrained linear 
estimation problem, of the same form studied in Thm. 3.1, as opposed to splitting its so- 
lution into two steps: unconstrained estimation for determining the feedforward filter and 
constrained estimation for determining the feedback filter, as was done above. 

To see this, we introduce the extended vector 


s(i-A-1) ] 
s(i—-A-2) 


a | si-A-Q) 
y(i) 
yli - 1) 


vcro) 


(L+Q)x1 


a 


which, under the assumption (6.22) of correct decisions, contains all the observations that 
are used by the feedforward and feedback filters in order to estimate the variable x = 
&(i — A). We also define the 1 x (L + Q) row vector 


which contains the equalizer coefficients {f (i), b(z)} that we wish to determine. In this 
way, 


èli- A)-k'r 
and the estimation error, 3 = 8(i — A), is given by 


É-oz-kr 


so that problem (6.21) becomes equivalent to solving 


min E |x - k*r|? (6.35) 
This is a standard linear least-mean-squares formulation, and its solution is given by (cf. 
Thm. 3.1): 

636 
where 


Rr =Es(i-A)r* and R,=Err* 


and that the resulting m.m.s.e. is 


c2 — Rz Rz Ry, (6.37) 


Some algebra will show that expressions (6.36) and (6.37) lead to the same solutions 
(6.29)-(6.30) — see Prob. IL41. 


6.5 APPLICATION: ANTENNA BEAMFORMING 


Our third application is in the context of antenna beamforming. In this application, we 
desire to combine the measurements of an antenna array in order to maximize its gain 
along a particular direction. 

Consider the diagram of Fig. 6.4, which shows a linear array of sensors or antennas, 
assumed uniformly spaced, with the separation between two adjacent elements denoted by 
d. The antenna array is assumed to be far from a source radiating an electromagnetic wave 
of the form 


r(t) = s(t)e**, j à zi 


where we denotes the carrier frequency and s(t) denotes the envelope, also called the base- 
band signal. Since the source is sufficiently distant from the antenna array, the arriving 
wavefronts can be assumed to be planar at the array. The output of the array is obtained by 
linearly combining the measurements at the antennas. These measurements are subjected 
to noise with the noise at antenna j at time t denoted by v;(t). 

Let 0 denote the direction of arrival of the wavefront relative to the plane of the antennas, 
as indicated in the figure, and consider the two leftmost antennas, labelled 0 and 1. The 
distance that separates the planar waves arriving at these two elements is equal to d cos 8. 


FIGURE 6.4 A uniformly-spaced array of antennas. 
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Moreover, the interval of time At that is needed for the wave to propagate from antenna 1 
to antenna 0 is d cos 0/c, where c is the speed of propagation or, equivalently, 


At = (21d cos 0)/(w.A) 
where À is the wavelength of the wavefront; it is related to c via 


Wer 
27 


c= 


Now if s(t)ej^-* is the signal at antenna 0 at time t, then s(t + At)e/*«* ^ is the 
signal at that same time instant at antenna 1. Therefore, if we assume a slowly-varying 
envelope, i.e., if s(t + At) ~ s(t), then the signal at antenna 1 at time t is 


s(t)et (1-225992) — qq) edet oi E dcos o 
More generally, by following a similar argument, the signal arriving at the n-th antenna at 


time t is of the form 


s(t)e* etei Fr d cos 6 n=0,1,...,M—1 


If at time ¢ we take a snapshot of the values of the signals at the M antenna elements we 
obtain, from left to right, 


s(t)eret | 1 ei & dcos 8 ei & d cos8 Tos ei Sd cos 6 


Usually, before processing by the antenna array, the incident signals are first converted 
to baseband, which means that the carrier component eJ7** is removed. In this way, the 
signals received by the antennas at time ¢ can be assumed to be of the form: 


s(t)eJ td cos 8 n-20,1,...,M -1 


Now let y denote a noisy snapshot of the baseband signals at the antennas, i.e., 


1 vo(t) 

PI vi(t) 

y= eX cos 6 s(t) + v(t) 
ef “Y= deose vy -i(t) 

h:Mx1 v:Mx1 


where v is a column vector whose entries correspond to the noise components at the indi- 
vidual antennas. Moreover, the column vector h is dependent on the direction of arrival 0, 
the wavelength A, and the geometry of the array as defined by (M, d). Observe that the 
entries of y correspond to measurements at different points in space rather than at different 
points in time, i.e., we are now dealing with an application that involves spatial sampling 
as opposed to time sampling. 

The purpose of a beamformer is to combine the entries of the snapshot, say, as k*y for 
some column vector k, in order to achieve two objectives. 


1. Directionality. The vector k must satisfy k*h = 1. This condition guarantees that 
when the incident direction of arrival is 6, the response of the beamformer, in the 
absence of any noise, will be 


k*y = s(t)k*h = s(t) 
In other words, an incident wave along the direction 6 will pass through the beam- 
former undistorted. 


2. Interference attenuation. When the snapshot y is corrupted by additive noise 
(including the effects of interferences caused by signals not originating from the 
direction 0), we would like the output of the beamformer to provide an estimate of 
s(t). For this purpose, we should choose k in order to minimize the error variance, 
i.e., by solving 

min E\s(t)|? 


where 
S(t) = s(t) — 8(t) = s(t) - k*y = s(t) — k'[hs(t) + v(t)) = —k*v 


since k*h = 1. Therefore, the second requirement on k is to minimize the variance 
k* R,k, where R, is the covariance matrix of v. 


The beamforming problem, also known as the linearly-constrained minimum-variance 
problem, or as the minimum-variance distortionless-response problem, is then to solve 


min k*"R,k subjectto k*h=1 


Using the result in Remark 6.1, we find that the optimal choice for k is 


v CR 


= = 
h* Ry h 


in which case the beamformer output is 


and the resulting m.m.s.e. is 


If the noise at the antennas is spatially white, i.e., if the noise at each sensor is uncor- 
related with the noises at the other sensors, and if the variance of all noises is c2, then 
R= 621, In this case, the output of the beamformer becomes 


h* Ro} 1 pin, ; 2c 
a(t) = Qo eee ume m te^ 37 d coso 
Dr EU EET » wall 
where we have denoted the individual entries of y by y, (t). This expression shows that, 
in the case of white spatial noise, the beamformer first aligns the phases of the signals it 
receives and then averages them. 
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T. theory developed in Chapters 3-5 on linear estimation can be used to introduce one 
of the most celebrated tools in linear least-mean-squares estimation theory, namely the 
Kalman filter. The filter has an intimate relation with adaptive filter theory, so much so 
that a solid understanding of its functionality can suggest extensions of classical adaptive 
schemes. A demonstration to this effect will be given later in Chapter 31, after we have 
progressed sufficiently enough in our treatment of adaptive filters. At that stage, we shall 
tie up the Kalman filter with adaptive least-squares theory and show how it can motivate 
useful extensions. For the time being, it suffices to treat the material in this chapter as 
simply an application of linear least-mean-squares estimation theory. 


7.4 INNOVATIONS PROCESS 


Consider two zero-mean random variables (2, y). We already know from Thm. 3.1 that 
the linear least-mean-squares estimator of x given y is ¢ = K,y, where Ke is any solution 
to the normal equations 


KoRy = Rey (7.1) 

In the sequel we assume that R, is positive-definite so that K is uniquely defined as 
Ko = Ray R}. 

Usually, the variable y is vector-valued, say, y = col{yp, yi, ..., Yn}, where each y; 


is also possibly a vector. Now assume that we could somehow replace y by another vector 
e of similar dimensions, say, 
e — Ay (7.2) 


for some lower triangular invertible matrix A. Assume further that the transformation 
A could be chosen such that the entries of e, denoted by e = col(eo, ei,.... ew). are 
uncorrelated with each other, i.e., 


^ 
Eee; = Re idij 


where ó;; denotes the Kronecker delta function that is unity when i = j and zero otherwise, 
and Re,; denotes the covariance matrix of e;. Then the covariance matrix of e will be block 
diagonal, 


Re Ê Eee" = diag{Reo, Re1,-..,Re,n} 


and, in addition, the problem of estimating æ from y would be equivalent to the problem 
of estimating x from e. To see this, let |, denote the linear least-mean-squares estimator 
of x given e, i.e., 

Sig = Rack, '€ (7.3) 
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Likewise, let 2, denote the estimator of æ given y, 


Ly = RU (7.4) 
Then since 
Re = Eee" = A(Eyy") A* = ARA" 
and 
Rze = Exe* = (Exy”) A* = Ry, A* 
we find that 


Be = Rak, = RryA* (AR, A*)~* e= RR, A'e = Rey RY 
That is, 

d. = iy (7.5) 
as claimed. This means that we can replace the problem of estimating x from y by the 
problem of estimating x from e. The key advantage of working with e instead of y is 


that R. in (7.3) is block-diagonal and, hence, the estimator £je can be evaluated as the 
combined sum of individual estimators. Specifically, expression (7.3) gives 


N 


N 
ĉie = v. (Emel Rje: = y Ble, 
i=0 


i=0 


This result shows that we can estimate æ from y by estimating æ individually from each e; 
and then combining the resulting estimators. In particular, if we replace the notations £j, 
and £y by the more suggestive notation 2, y, in order to indicate that the estimator of æ is 
based on the observations yo through y y, then the above expression shows that 


N-1 


N 
Zin = Lie, = Liew t M Lhe, 
i=0 i=0 


where the last sum on the right-hand side is simply the estimator of x using the observa- 
tions y; through y y. ,. It follows that 


fy = Pii We. 


ĉin = $a + (Ezex) Rz yen (7.6) 


This is a useful recursive formula; it shows how the estimator of æ can be updated recur- 
sively by adding the contribution of the most recent variable ey. 

The question now is how to generate the variables (e;) from the (y;). One possible 
transformation is the so-called Gram-Schmidt procedure. Let y;,;_, denote the estima- 


tor of y; that is based on the observations up to time i — 1, i.e., on (yg. yi... Vial 
The same argument that led to (7.5) shows that y,,,_; can be alternatively calculated by 
estimating y, from (eo... .. €i-1). Then we can construct e; as 


(7.7) 


That is, we can choose e; as the estimation error that results from estimating y; from the 
observations (yo. y; .... y; 1]. In order to verify that the resulting (e;) are uncorrelated 


105 


SECTION 7.1 
INNOVATIONS 
PROCESS 


106 


CHAPTER 7 
KALMAN 


FILTER 


with each other, we recall that, by virtue of the orthogonality condition of linear least- 
mean-squares estimation (cf. Thm. 4.1), 


ei L {YoY o Vial 


That is, e; is uncorrelated with the observations (yo, Y: ..., y; 1]. It then follows that e; 
should be uncorrelated with any e; for j « i since, by definition, e; is a linear combination 
of the observations (yo; y; . .., Y; } and, moreover, 


{Yo Yrs Y} Clyo vos Yi-1} for ji 


By the same token, e; is uncorrelated with any e; for j > i. 

It is instructive to see what choice of a transformation A in (7.2) corresponds to the use 
of the Gram-Schmidt procedure. Assume, for illustration purposes, that N — 2. Then 
writing (7.7) for i = 0,1,2 we get 


€o I Yo 
e, | = | -(Ey.y5)(Eyoyg)7? I Vi 
ez x x I Yo 


where the entries x arise from the calculation 


[x x ]=(Ew[ ys y pefe] ey 


Yı Yı 


We thus find that A is a lower triangular transformation with unit entries along its diago- 
nal. The lower triangularity of A is relevant since it translates into a causal relationship 
between the {e;} and the (y;). By causality we mean that each e; can be computed from 
(yj.j < i} and, similarly, each y; can be recovered from {e;,j < i}. We also see 
from the construction (7.7) that we can regard e; as the “new information” in y; given 
{Y0 - --, Vi-1)- Therefore, it is customary to refer to the {e;} as the innovations process 
associated with the {y, }. 


7.2 STATE-SPACE MODEL 


As we now proceed to show, the Kalman filter is an efficient procedure for determining 
the innovations when the observation process (y; arises from a finite-dimensional linear 
state-space model. 

What we mean by a state-space model for {y,} is the following. We assume that y; 
satisfies an equation of the form 


y; = Hiz; +v i20 (7.8) 


in terms of an n x 1 so-called state-vector x;, which in turn obeys a recursion of the form 
Titi = Fixi + Gini, i20 (7.9) 


The processes v; and n; are assumed to be p x 1 and m x 1 zero-mean white noise 
processes, respectively, with covariances and cross-covariances denoted by 


ni nla4[Qi Si], 
2 bon bed [2 & j^ 


whereas the initial state zg is assumed to have zero mean, covariance matrix IIo, and to be 
uncorrelated with {n;} and {v;}, i.e., 


Ezorj =o, Enjag =0, and Evjxj—0 forall i>0 


The assumptions on (26, n;, v;) can be compactly restated as 


(7.10) 


It is also assumed that the matrices 
F; (nxn), Gi (nxm), Hi(pxn), Qi (mxm), Ri(pxp), S; (mxp), IIo (nxn) 
are known a priori. The process v; is called measurement noise and the process m; is called 


process noise. We now examine how the innovations {e;} of a process (y;) satisfying a 
state-space model of the form (7.8)-(7.10) can be evaluated. 


7.3 RECURSION FOR THE STATE ESTIMATOR 


Let (9,1. 2; 1, B;-1} denote the estimators of the variables (y; a, vi) from the 
observations (9o. Y1,- - - , V; 1); respectively. Then using y; = Ha; + v;, and appealing 
to linearity, we have 

Vir = Hitii-1 + Viii (7.11) 
Now the assumptions on our state-space model imply that 

vily; for j<i-l 
i.e., v; is uncorrelated with the observations Lm j&€í- 1}, so that 


Üi- = 0 


This is because from the model (7.8)-(7.9), y; is a linear combination of the variables 
(vj. nji,- No, Lo}, all of which are uncorrelated with v; for j < i— 1. Consequently, 


ei = Yi - Viii = Vi - Hitiia (7.12) 


Therefore, the problem of finding the innovations reduces to one of finding 2;;;..,. For this 
purpose, we can use (7.6) to write 


a _ A * -1 
fini = Cigar + (Ezi416e7) Rz;ei 


os + (Ege) Rzi(yi — Hitis—1) (7.13) 


mu» 


But since 2j..; obeys the state equation z;,1 = Fix; + Gini, we also obtain, again by 
linearity, that 


where 


fiui Fifi- + Ginyi-1 = A®yi-1 + 0 (7.15) 
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since n; L y;,j € i — 1. By combining Eqs. (7.12)-(7.15) we arrive at the following 
recursive equations for determining the innovations: 


Yi — Hitii-i 


z 7.16 
Fidji.i- Kp,i€i, SO) 
with initial conditions 
€9|.1 = 0, eo = Yo (7.17) 
and where we have defined the gain matrix 
A * -1 
Kpi = (Exisie7) Rz; (7.18) 


The subscript “p” indicates that K’,; is used to update a predicted estimator of the state 
vector. By combining the equations in (7.16) we also find that 


: - ^ i , 
fau Fpifiji-1 + Kyo Fpi 5 Fi- KpiHi, @oj-1=0,120; (7.19) 


which shows that in finding the innovations, we actually also have a complete recursion 
for the state-estimator {2;;_1}. 


7.4 COMPUTING THE GAIN MATRIX 


We still need to evaluate K,,; and R,,;. To do so, we introduce the state-estimation error 
Žili-1 = ©; — By,_1, and denote its covariance matrix by 


A - ~k 
Py = E&jicifjii (7.20) 


Then, as we are going to see, the {Kp,;:, Re,;} can be expressed in terms of P,);_, and, in 
addition, the evaluation of P;;_; will require propagating a so-called Riccati recursion. 
To see this, note first that 


ei —-yi— Hii- = Hiz; - Hi&ii-i += Hiii- +v; (7.21) 


Moreover, v; L $i. This is because %;);_, is a linear combination of the variables 


(vo, ..., vi-1, Z0, NO, ..., 24-1}, all of which are uncorrelated with v;. This claim fol- 
lows from the definition Z;j;., = x; — $;j;.; and from the fact that 2; ., is a linear 
combination of (yg,..., y; 1) and a; is a linear combination of (29, no, .. ., ni i). 


Therefore, we get 
Rei = Eeijej = Ri + Hj Pi,-1H} (7.22) 


Likewise, since 
Exi41e; =F; (E viej) t Gi (E nie; ) (7.23) 


with the terms Ez;e? and E n;e; given by 
Eze; = E (£41 T $ji-1) ei 
= Eye], since ei L ĉii 
= Ed£4jq i(Hifi-i- vi)" 
E&; 1 (Hiini +0, since v; L Sii 
= PBQuoaH; 


I 


and 
En;e; = Emj;(Hifiji.i- vi) 
= O+En vj, sine ni L ii-i 
we get 


Kp = (Ewisret) R7} = (FP i Hf + GiS)Rzi (7.24) 


7.5 RICCATI RECURSION 


Since n; L dj, it can be easily seen from x;4; = Fix; + Gin; that the covariance matrix 
of x; obeys the recursion 


Tigi = FILES +G:Q:G;, Il; = Erir; (7.25) 


Likewise, since e; 1 #,);_1, it can be seen from 241); = Fiĝip-1 + Kp,e; that the 
covariance matrix of 2; .; satisfies the recursion 


Eni- RUF + KQHEKQQ Di È Eê (7.26) 
with initial condition Xo = 0. Now the orthogonal decomposition 

Ti = i-i + Zip- With Êip-i L iii 
shows that II; = X; + Py,;_1. It is then immediate to conclude that the matrix Pj,1j; = 
I; — “441 satisfies the recursion 


Pisiy = Fi Pi FF + GiQiG} — Kp i Rei K5 


pi Poi = Ho (7.27) 


which is known as the Riccati recursion. 


7.6 COVARIANCE FORM 


In summary, we arrive at the following statement of the Kalman filter, also known as 
the covariance form of the filter. 


Algorithm 7.1 (The Kalman filter) Given observations (y;) that satisfy the 
state-space model (7.8)-(7.10), the innovations process (ei) can be recur- 
sively computed as follows. Start with 29,., = 0, Poj-1 = IIo, and repeat 
for à > 0: 


Rei = Ri + HIPH; 
Kpi = (FH; Gi$;)Rz1 
ei = Yi- Hifi- 
Tfi T Fidi, + Kei 
Pia = Fy Pale + GiQiG; - Ky Rea, 
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FILTER 


7.7 MEASUREMENT AND TIME-UPDATE FORM 


The implementation described in Alg. 7.1 is known as the prediction form of the Kalman 
filter since it relies on propagating the one-step prediction {#;|;_;}. There is an alternative 
implementation of the Kalman filter, known as the time- and measurement-update form. It 
relies on going from 2j; .; to $;j; (a measurement-update step), and on going from 2|; to 
£441); (a time-update step). 

For the measurement-update step, it can be verified by arguments similar to the ones 
used in deriving the prediction form, that 


By = By-1t+Kpie:, Kz, à 3-17 Rei (7.28) 
with error covariance matrix 
Pi 2 E čij; = Pia — Pu; H? Rg} HiPyi-1 (7.29) 
Likewise, for the time-update step we get 
ists Fifi G;S;R; e, (7.30) 
with 
Pay = FiPii Fy + Gi(Qi - SiRB1S])G; - AK, iSiG} — GiS K3 F? 7.3) 
When S; = 0, this latter recursion simplifies to Pi41); = F;Pj;F7 + GiQiG}. 
Algorithm 7.2 (Time- and measurement-update forms) Given observations 
{y,} that satisfy the state-space model (7.8), (7.9), and (7.10), the inno- 


vations process {e;} can be recursively computed as follows. Start with 
£o|-1 = 0, Poj_1 = IIo, and repeat for i > 0: 


Rei = RoHHPQ2H; 


Ky = Pua HR; 
e; = yi-— Hifili-1 


ip = dqgoitKyei 
Gini = Fey + GiSiRzjes 
Pu = Pu-i- Piu- H7 Rz; HiP- 
Pag; = FP Ff + Gi(Qi — S;R2157)G; — FKpiS;G; — GiS KGF? 


Summary and Notes 


| he chapters in Part II introduce the basic principles underlying (constrained and unconstrained) 
linear least-mean-squares estimation with several design examples. The most relevant results are 
summarized below. 


SUMMARY OF MAIN RESULTS 


KN 


. Given two zero-mean random variables {x, y}, the optimal linear estimator (optimal in the 


least-mean-squares sense) of x given y is 2 = Koy, where Ko is any solution to the nor- 
mal equations Ko Ry = Rry. This construction minimizes the error covariance matrix (or, 
equivalently, its trace), and the resulting minimum mean-square error matrix is m.m.s.e. = 
Rz — KoRyz. 


. The normal equations are always consistent, i.e., they always admit a solution Ko. The solu- 


tion is unique only when R, is positive-definite; otherwise there are infinitely many solutions. 


. No matter which solution we pick for Ko (when many solutions exist), the resulting estimator 


and minimum mean-square error values remain invariant. 


. The optimal linear estimator satisfies the orthogonality condition & L y. That is, the error is 


orthogonal to the observations, and to any linear transformation of the observations for that 
matter. In particular, z 1 $. 


. If the variables {a, y] do not have zero means, we can replace them by their centered versions, 


{a — &,y — Į}, and then find & from @ — z = Koly — 9). 


. When (x, y) are related by a linear model, say y = Hæ +v, then the optimal linear estimator 


of x given y can be determined from either expression 
$ = RGDS[RSRHR.H] y = [RS 6HTURSH] HUR.Qy 


where it is assumed that R, > 0 and Rs > 0. The resulting minimum mean-square error is 
E -l1g-d 
m.m.s.e. —-[R; + H* RZ H]. 


. Given y = Hz + v, where H is a tall full rank matrix and x is an unknown determin- 


istic quantity that we wish to estimate, the minimum variance unbiased estimator of x is 
given by 2 = (H*R; H) ! H' R;!y. The corresponding minimum mean-square error is 
m.m.s.e. = (H* R;! H) 


The solution to a generic constrained optimization problem of the form 


min KR,K" subjectto KH =I and R, >0 


is given by Ko = (H^R;! H) ' H' R71, with the minimum cost equal to (H* Rz; ! H)^ 5. 


. Such constrained estimation problems are useful in the design of decision-feedback equalizers 


and antenna beamformers. The expressions for the feedforward and feedback filters in a 
decision-feedback equalizer are given in Sec. 6.4, while the expression for the gain vector of 
an antenna beamformer is given in Sec. 6.5. 
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BIBLIOGRAPHIC NOTES 


Linear estimation. In this part we covered the basics of linear estimation theory, and highlighted 
those concepts that are most relevant to the subject matter of the book. However, it should be noted 
that linear least-mean-squares-error estimation is a rich field and it has had a distinguished history. 
The pioneering work in the area was done independently by Kolmogorov (1939,1941a,b) and Wiener 
(1942,1949). Kolmogorov was motivated by the work of Wold (1938) on stationary processes and 
solved a so-called linear prediction problem for discrete-time stationary random processes (see, e.g., 
Probs. II.31 and II.32 for some discussion on prediction problems). Wiener, on the other hand, solved 
a continuous-time prediction problem under causality constraints by means of an elegant technique 
now known as the Wiener-Hopf technique (Wiener and Hopf (1931)). Appendix 3.C of Sayed (2003) 
describes one particular causal estimation problem that exemplifies some of the elegance involved in 
the solution of Wiener and Hopf; the appendix focuses on a discrete finite-time horizon estimation 
problem, whereas Wiener and Hopf (1931) studied the more demanding continuous-time infinite 
horizon case. Unfortunately, in the literature of adaptive filtering, there seems to exist a persistent 
confusion with regards to Wiener's contribution. Most authors often mistakenly refer to the normal 
equations (3.9) as Wiener's solution. As is shown in App. 3.C of Sayed (2003), Wiener solved a 
deeper and more elaborate problem. Readers interested in more details about Wiener's contribution 
may consult the textbook by Kailath, Sayed, and Hassibi (2000), which offers a detailed treatment of 
the subject. 


Wiener and Kalman filters. The studies of Kolmogorov and Wiener laid the foundations for most 
of the subsequent developments in linear estimation theory. In particular, their investigations were 
fundamental to the development of the Kalman filter two decades later in elegant articles by Kalman 
(1960) and Kalman and Bucy (1961). While the works of Wiener and Kolmogorov focused on sta- 
tionary random processes, Kalman's filter had the powerful feature of being the optimal filter for 
both stationary and nonstationary processes. For this reason, the Kalman filter is considered in many 
respects to be one of the best successes of linear estimation theory. The filter also plays a prominent 
role in the context of adaptive filtering, as we shall indicate later in Parts VII-X on least-squares the- 
ory. In preparation for these discussions, we provided a derivation of the Kalman filter in Chapter 7 
for discrete-time models by using the innovations approach of Kailath (1968); this approach exploits 
to great advantage the orthogonality condition of linear least-mean-squares estimation. As is shown 
in Chapter 7, rather than determine the optimal steady-state filter for stationary processes, as was 
the case in Wiener's work, Kalman devised a recursive algorithm that is optimal during all stages of 
adaptation and which, in steady-state, tends to Wiener's solution. In Kalman's filter, recursion (7.27) 
plays a prominent role in propagating the error covariance matrix, and we shall encounter a special 
case of it later in Chapter 30 when we study the recursive least-squares algorithm (RLS). Kalman 
termed this recursion the Riccati recursion in analogy to a differential equation attributed to Riccati 
(1724), and later used by Legendre in the calculus of variations and by Bellman (1957) in optimal 
control theory. 


Stochastic modeling. What is particularly significant about the works of Kolmogorov, Wiener, 
and Kalman is the insight that the problem of separating signals from noise, as well as prediction 
and filtering problems, can be approached statistically. In other words, they can be approached by 
formulating statistical performance measures and by modelling the variables involved as random 
rather than deterministic quantities (as was the case with our formulation of the mean-square-error 
criterion in this chapter). This point of view is in clear contrast, for example, to deterministic (least- 
squares-based) estimation techniques studied earlier by Gauss (1809) and Legendre (1805,1810), 
and which we shall study in some detail starting in Chapter 29. [An overview of the historical 
progress from Gauss' work to Kalman's work can be found in Sorenson (1970) and in the edited 
volume by Sorenson (1985) — see also Kailath (1974).] The insight of formulating estimation 
problems stochastically in this manner has had a significant impact on the fields of signal processing, 
communications, and control. In particular, it has led to the development of several disciplines that 
nowadays go by the names of statistical signal processing, statistical communications theory, optimal 
stochastic control, and stochastic system theory, as can be seen, for instance, by examining the titles 


of some of the references in these fields (e.g., Lee (1960), Astróm (1970), Caines (1988), Scharf 
(1991), and Kay (1993)). 


Linear models. In this part we did not treat the linear least-mean-squares estimation problem in its 
generality; we only focused on the part of the material that is relevant to the development of adaptive 
filters in subsequent chapters. In particular, we emphasized the role played by the orthogonality 
principle (Sec. 4.2), as well as the simplifications that result when the underlying variables are related 
via a linear model (Sec. 5.1). Such linear models will play a crucial role throughout our development 
of adaptive filtering in this book. For instance, in Sec. 5.4 we already showed how the linear model 
can be exploited to derive compact closed-form expressions for the design of finite-length linear 
equalizers. 


Matrix inversion formula. The matrix inversion formula (5.4) is often attributed to Woodbury 
(1950). One of the first uses of the formula in the context of filtering theory is due to Kailath (1960a) 
and Ho (1963) — see the article by Henderson and Searle (1981) for an account on the origin of the 
formula. 


Space-time coding. Linear least-mean-squares estimation is useful in many applications. To 
illustrate this fact, in Probs. IL.24 and II.38, we show its relevance to a transmit diversity signaling 
technique devised by Alamouti (1998). The scheme exploits spatial diversity and results in a simple 
receiver structure. Due to its many attractive features, this signaling technique has been adopted 
in several wireless standards for code division multiple access (CDMA) communications such as 
WCDMA and CDMA2000. A brief overview of CDMA systems is provided in Computer Project 
11.3 of Sayed (2003). 


Single-carrier-frequency-domain equalization. In Prob. II.28 we also show how linear least- 
mean-squares estimation is useful in the context of single-carrier-frequency-domain equalization, 
which involves the addition of a cyclic prefix to the data before transmission in order to simplify the 
structure of the receiver. One of the first works on single-carrier-frequency-domain equalization is 
that of Walzman and Schwartz (1973). More recent works appear in Sari, Karam, and Jeanclaude 
(1995), Clark (1998), Al-Dhahir (2001), and Younis, Sayed, and Al-Dhahir (2003). Cyclic prefixing 
is also useful in the context of orthogonal frequency-division multiplexing (OFDM), where the data 
are first Fourier transformed before the inclusion of a cyclic prefix. OFDM is studied later in Com- 
puter Project VILL. 


Unbiased estimators. In all least-mean-squares estimation problems studied in Chapters 1-7, 
the estimators were required to be unbiased for both cases of deterministic and stochastic unknown 
variables. Sometimes, the condition of unbiasedness can be a hurdle to minimizing the mean-square 
error. This is because, as is well known from the statistical literature, there are estimators that are 
biased but that can achieve smaller error variances than unbiased estimators (see, e.g., Rao (1973), 
Cox and Hinkley (1974), and Kendall and Stuart (1976-1979)). 

Two interesting examples to this effect are the following (the first example is from Kay (1993, 
pp. 310—311) while the second example is from Rao (1973)). In Sec. 6.2 we studied the problem 
of estimating the mean value, x, of N measurements (y(i)]. The minimum-variance unbiased 
estimator for z was seen to be given by the sample mean estimator: 


The value of z was not restricted in any way; it was only assumed to be an unknown constant and 
that it could assume any value in the interval (—oo, oc) (assuming real-valued data). But what if we 
know beforehand that z is limited to some interval, say [—a, œ] for some finite a > 0? One way to 
incorporate this piece of information into the design of an estimator for æ is to perhaps consider the 
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following alternative construction: 


-a if mw. < —@ 
g£= SZmvue if — a € Emwe <a 
a if Zmvue > Q 


in terms of a realization for Zmvue. In this way, # will always assume values within [—a, a]. A calcu- 
lation in Kay (1993) shows that although the (truncated mean) estimator & is biased, it nevertheless 
satisfies E (x — $)? < E (x — $)?. In other words, the truncated mean estimator results in a smaller 
mean-square error. 

A second classical example from the realm of statistics is the variance estimator. In this case, the 
parameter to be estimated is the variance of a random variable y given access to several observations 
of it, say (y(i)). Let c2 denote the variance of y. Two well-known estimators for c? are 


DA 1 y3 
Z3 - i Sx? dium orm DUE! 

A= yLWO-a md Å= yy LO- 

where 7 = Ey. The first one is unbiased while the second one is biased. However, it is shown in 
Rao (1973) that E(c2 — 02)? < E (o2 — a2 )?. We therefore see that biased estimators can result in 
smaller mean-square errors. However, unbiasedness is often a desirable property in practice since it 
guarantees that, on average, the estimator agrees with the unknown quantity that we seek to estimate. 


Channel equalization. The earliest work on equalization methods for digital communications 
applications dates back to the mid 1960s. It was first proposed by Lucky (1965) in the design of a 
linear equalizer. Lucky did not use the mean-square-error criterion, as we did in Secs. 5.4 and 6.4, 
but rather the so-called peak-distortion criterion. Lucky’s solution led to what is known as the zero- 
forcing equalizer whereby the equalizer is essentially the inverse of the channel. This design method 
ignores the presence of noise and, as a result, it can lead to noise amplification and performance 
degradation. Another early work on equalization is that of DiToro (1965). 

Soon afterwards, Widrow (1966), followed by Gersho (1969) and Proakis and Miller (1969), 
proposed using the mean-square-error criterion instead of the peak-distortion criterion. The mean- 
square-error criterion takes the noise into account and generally leads to superior performance. Since 
then, the criterion has been used extensively in the design of optimal equalizer and receiver struc- 
tures, including decision-feedback equalizers and fractionally-spaced equalizers. The first work on 
decision-feedback-equalization was that of Austin (1967), followed by Monsen (1971), George, 
Bowen, and Storey (1971), Price (1972), and Salz (1973). Some of the earliest references on 
fractionally-spaced equalizers were those by Brady (1970), Ungerboeck (1976), Qureshi and For- 
ney (1977), and Gitlin and Weinstein (1981). 

Appendix 3.B of Sayed (2003) motivates the need for equalization methods and explains their role 
in compensating for the time dispersion introduced by communications channels and in combating 
inter-symbol interference. Among other results, the appendix explains the origin of the discrete-time 
channel model (5.10), as well as the concepts of symbol-spaced and fractionally-spaced equalization. 
The material in the appendix complements the discussions in Secs. 5.4 and 6.4 on the design of linear 
and decision-feedback equalizers. Further details on equalization techniques can be found in the 
textbooks by Gitlin, Hayes, and Weinstein (1992) and Proakis (2000). Another accessible reference 
is the survey article by Qureshi (1985), which deals primarily with adaptive equalization. 


Finite-length DFE. A finite-length formulation of the decision-feedback equalizer that is similar 
to the one presented in Sec. 6.4 can be found in Al-Dhahir and Cioffi (1995). A similar design of 
decision-feedback equalizers for MIMO systems (e.g., systems that involve multiple transmit and 
receive antennas) is studied in Probs. II.42-II.45 following the work of Al-Dhahir and Sayed (2000). 


Beamforming. Beamforming has applications in several areas including radar, sonar, and com- 
munications (see, e.g., the book by Johnson and Dudgeon (1993) and the articles by Van Veen and 
Buckley (1988) and Krim and Viberg (1996); the last article discusses several other issues in array 
signal processing and subspace methods). 


Problems and Computer Projects 


PROBLEMS 


Problem It.1 (Rank-one modification of the identity matrix) Consider any matrix of the form 
I+ ry", where x and y are column vectors. Use the matrix inversion formula (5.4) to show that its 
inverse is also a rank-one modification of the identity. More specifically, show that 


T 


Ty“) r £V 
Cig von lcTy!z 


Problem 1I.2 (Determinant of a matrix) Consider a square matrix A. A fundamental result in 
matrix theory is that every such matrix admits a so-called canonical Jordan decomposition, which 
is of the form A = UJU-!, where J = diag{Ji,..., J+} is a block diagonal matrix, say with 
r blocks. Each J; is bi-diagonal with identical diagonal entries A;, and with unit entries along the 
lower diagonal, namely, 


Ài 


1 AX 
The size of each J;, say ni X ni, indicates the multiplicity of the eigenvalue A;. Show that det A = 


IL; AP. 


Remark. When A is Hermitian, it can be shown that J is necessarily diagonal (rather than block diagonal with 
bi-diagonal blocks). 


Problem 11.3 (Trace of a matrix) Use the canonical Jordan factorization of Prob. II.2, and the 
fact that Tr( XY) = Tr(Y X) for any matrices {X, Y ) of compatible dimensions, to show that the 
trace of a matrix, which is the sum of its diagonal elements, is also equal to the sum of its eigenvalues. 
What about the determinant of a matrix? 


Problem Il.4 (Matrix norm) Let A be an n x n matrix with eigenvalues (A;) and introduce its 
spectra) radius 


4 , 
AA) = max [Ad 


Let further A = TJT! denote the Jordan canonical form of A (cf. Prob. II.2), and define the n x n 
diagonal matrix D = diag{e, €?,.. . , e"), for any given positive scalar e. 


(a) Show that DJ D^ ! has the same form as J with the unit entries of J replaced by e. Conclude 
that the one norm of DJ D^! is equal to p( A) + e. 
Remark. The one norm of an n x n matrix B is denoted by || B||; and is defined as 


n 
A 
(Bl = mee, 2 Pul 
TT aS 


115 


116 


Part Il 
PROBLEMS 


That is, it is equal to the maximum absolute column sum of B. 


(b) For any n x n matrix B, define the function ||B ||]; E | DTZ BT D^ |. Show that the 
function || - ||; so defined is a matrix norm (i.e., show that it satisfies the properties of a norm, 
as defined in Sec. B.6.) 


(c) Verify that p(A) < ||Al|p < p(A) + €. 


Problem I1.5 (Minimum of a quadratic form) Consider the quadratic cost function J(z) = 
(x — c)* A(z — c), where z and c are column vectors and A is a Hermitian nonnegative-definite 
matrix. Argue that the minimum value of J(z) is zero and it is achieved at x = c + for any d 
satisfying Ad = 0. 


Problem II.6 (Trace of error covariance matrix) Use (3.5), and expression (3.17) for the opti- 
mal value of E | (i)/?, to justify (3.18). 


Problem li.7 (Determinant of error covariance matrix) For any square matrices A and B 
satisfying A > B > 0, show that det A > det B. Conclude that the linear least-mean-squares 
estimator of Thm. 3.1 also minimizes the determinant of the error covariance matrix, det (E a"). 


Remark. We therefore see that the linear least-mean-squares estimator minimizes both the trace and the determi- 
nant of the error covariance matrix. This fact is behind the use of the terminologies arithmetic SNR and geometric 
SNR as performance measures, which are defined by 


1/p 

ASNR & Tr(Re) ES) 
Tr(Rz) det(Rz) 

where p is the dimension of a. The covariance matrix Rz denotes the m.m.s.e. of the estimation problem, 

Rz = Re — KoRyz. 


GSNR & ( 


Problem II.8 (Weighted error cost) Show that the estimator of Thm. 3.1 also minimizes E £"W à: 
for any W > 0. 


Problem II.9 (Independence vs. uncorrelatedness) How would the answer to Ex. 4.1 change 
if the noise sequence {v(i)} were only assumed to be uncorrelated with, rather than independent of, 
the data {s(i)}? 


Problem II.10 (Second-order statistics) How would the answer to Ex. 4.1 change if the sym- 
bols (s(i)) were instead chosen uniformly from a QPSK constellation? 


Problem 11.11 (Nonzero means) Refer again to Ex. 4.1 and assume a generic value for p, 0 < 
p < 1. [In the example, we used p = 1/2.] In this case, the random variables (s(0), s(1)) do 
not have zero means anymore, and the linear least-mean-squares estimator of x = col(s(0), s(1)) 
given y = col{y(0), y(1)} becomes $ = Z + Rz, R; (y — j), where 


E=Ex, jg-Ey  HR-E(z-z)y-3P), R,—-E(y-W)w-U» 
(a) Find (z, 9, Ry, Rey}. 
(b) Determine 8(0) and &(1). 
(c) Simplify the results for p — 1/2. 
Problem li.12 (Perfect estimation) Consider a linear model of the form d = w?" u + v, where 
(w?, u} are column vectors and v is a scalar. Both (u, v) are zero-mean uncorrelated random vari- 


ables with E, = E uu" > 0. Moreover, w° is unknown. Let k?" denote the row vector that defines 
the linear least-mean-squares estimator of d given u, i.e., d = k°*u. Show that k? = w°. 


Remark. This result has the following useful interpretation. If an observation d happens to be related linearly to 
some data u, then the linear least-mean-squares estimator of d given u identifies exactly the unknown vector w° 
that relates d to u. This result is useful in channel estimation applications, and also in the study of convergence 
properties of adaptive schemes, as will be shown later in this book — see, e.g., Secs. 10.5 and 15.2. 
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Problem II.13 (Correlated component) Assume a zero-mean random variable x consists of 
two components, © = £e + z, and that only x, is correlated with the observation vector y. Show 
that the linear least-mean-squares estimator of æ given y is simply the linear least-mean-squares es- 
timator of £e given y. 


Remark. The result shows that the linear least-mean-squares estimator can only estimate that part of æ that is 
correlated with the observation. This remark sounds obvious but it forms the basis of some useful applications — 
see, e.g.. Prob. IIL.46. 


Problem 1l.14 (Estimation of x?) Let y = x + v, where x and v are independent zero-mean 
Gaussian real-valued random variables with variances c2 and c2, respectively. Show that the linear 
least-mean-squares estimator of x? using (y, y?) is 

L3 2 ez 2 2 2 

r? =o, + = —02-—0C 

* * gExX3olebroi U $7709 

Hint: M s is a zero-mean real-valued Gaussian random variable with variance o2, then it holds that E s3 = 0 and 
Est = 3(E 82)? - 3e. 


Problem II.15 (Interfering signals) Refer to Prob. I.14. Find the linear least-mean-square-error 
estimator of 2 given y. 


Problem ll.16 (Power distortion through a channel) A zero-mean independent and identi- 
cally distributed sequence {s(-)}, with variance c2, is applied to an FIR channel with impulse 
response vector c (i.e., c contains the samples of the channel impulse response). Let {z(-)} de- 
note the channel output sequence. Verify that the variance of z (-) is given by c2 = c? ||cl|?, in terms 
of the squared Euclidean norm of c. 


Problem II.17 (Linear equalization and delayed estimation) Refer to Ex. 4.1 on linear equal- 
ization. Assume now that we wish to determine the optimal coefficients {a(0), o(1), a(2)} of the 
linear equalizer in order to estimate s(i — A), for some A. That is, the output of the equalizer in 
Fig. 4.3 should be changed from 8(i) to (i — A). 


(a) Find Rzy for A = 0,1,2,3. 


(b) Find the optimal equalizer coefficients, and the corresponding minimum mean-square errors, 
for A = 0,1, 2,3, 


(c) Would it be useful to use values of A larger than 3? Why? 
(d) Which value of A results in the smallest m.m.s.e.? Can you explain why? 


(e) The above calculations assume o? = 1. For an arbitrary noise variance o?, verify that 


$2.0; 1/2 0 
Ry = 1/2 +02 1/2 
0 1/2 +o? 


Does the value of Rzy depend on c2? What about the m.m.s.e.? What about the value of A 
that results in the smallest m.m.s.e.? 


S 


The variance of the transmitted data (s(;)) is unity. Verify that the variance of the channel 
output, {z(i)}, iso? = 5/4. f 


Compute the SNR at the output of the equalizer, SNR = E |s(i — A)|? /E |8(i — A)|°, for the 
case A = lando? = 0.05. 


M 


(g 


Problem ll.18 (Equalizer performance) Refer again to Ex. 4.1 on linear equalization. The SNR 
at the output of the equalizer is defined as SNR = E |s(i)/?/E|s(i)|?. We want to examine the 
performance of the equalizer as a function of its number of taps, say L. Let c2 denote the variance 
of the white noise v(i). 
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(a) Define the observation vector y € col(y(i), y(i — 1),..., y(i — L + 1)}, and show that 
its L x L covariance matrix is Toeplitz, namely, 


ry(0) r,(1) ry(2) 
A Ty(1) r,(0) ry(1) 


Fa = | ry(2) rv) ry(0) 
LxL 
with entries 
5/4 4- o2 k=0 
0 2<k<L-1 


We shall denote this covariance matrix by RP, with the superscript L indicating the size of 
the vector y. 


(b) Let æ = a(i). Verify that Rey =[1 0... eios 


(c) Show that the m.m.s.e. pertaining to the problem of estimating c from y, and the SNR at the 
output of the equalizer, are given by 


(L-1) 
m.m.se. — 1 deth I SNR = : (L-i) 
det RS ) EV det R. 
det RUP 


Problem il.19 (Useful identity) Let (c, y) denote two zero-mean random variables with positive- 
definite covariance matrices ( Rz, Ry}. Let & denote the linear least-mean-squares estimator of x 
given y. Likewise, let ĝ denote the linear least-mean-squares estimator of y given x. Introduce the 
estimation errors 4 = 2 — ĉ and y = y — y, and denote their covariance matrices by Rz and Ry, 
respectively. 

(a) Show that Rz R3 i$ = Ray Ry ly. 

(b) Assume (y, x} are related via a linear model of the form y = Ha + v, where v is zero-mean 


with covariance matrix R, and is uncorrelated with æ. Verify that the identity of part (a) 
reduces to Rz +ê = H*R;ly. 


Problem [1.20 (Combining estimators) Let x be a zero-mean random variable with an M x M 
positive-definite covariance matrix Rz. Let 21 denote the linear least-mean-squares estimator of x 
given a zero-mean observation y,. Likewise, let ĉ2 denote the linear least-mean-squares estimator 
of the same variable x given a second zero-mean observation y». That is, we have two separate 
estimators for x from two separate sources. Let P; and P? denote the corresponding error covariance 
matrices, i.e., P, = E11, P2 = E %23, where $; = x — $;, and assume P, > Oand P; > 0. 
Assume further that the cross-covariance matrix 


u BER ET) 


(a) Show that the linear least-mean-squares estimator of x given both observations (y;.y;). 
denoted by 2, satisfies P^!$ = P[ $i + P; 42, where P denotes the resulting error 
covariance matrix and is given by P^! = PL! + Pj! — Rz. 

Remark. This useful result tells us that the estimators (d, &2) and their error covariances ( P1, Pz} can 
be fused together without the need to access the original measurements (3/ , yo}. 


(b) Assume {y,, x} and (y5, x) are related via linear models of the form, y, = Hia + v1 and 
Y2 = How + v2, where (vi, v2) are zero-mean with covariance matrices { Rv, , Rus } and 
are uncorrelated with each other and with æ. Verify that this situation satisfies the required 


rank-deficiency condition and conclude that the estimator of x given {y,, Y2} is given by the 
expression in part (a). 


Problem 1I.21 (Linear and optimal estimators) A random variable z is defined as follows 


P —z with probability p 
Hxz+v with probability 1 — p 
where x and v are zero-mean uncorrelated random variables. Assume we know the linear least- 
mean-squares estimator of z given y, namely, 2|,, where y is a zero-mean random random variable 
that is also uncorrelated with v. 


(a) Find an expression for 2), in terms of 2. 
(b) More generally, what can you say about the relation between E (z|y) and E (z|y)? 


(c) Find an expression for the linear least-mean-squares estimator 2,4 and the corresponding 
m.m.s.e. 


(d) How would your answers to parts (a) and (b) change if the random variables (v, y} were 
jointly circular and Gaussian? 


Problem 1.22 (Distributed processing) Consider a distributed network with m nodes as shown 
in Fig. IL1. Each node k observes a zero-mean measurement y, that is related to an unknown zero- 
mean variable z via a linear model of the form y, = Hz +x, where the data matrix Hx is known, 
and the noise v, is zero mean and uncorrelated with x. The noises across all nodes are uncorrelated 
with each other. Let {Rz, Ry, } denote the positive-definite covariance matrices of (z, vx}, respec- 
tively. Introduce the following notation: 


e At each node k, the notation 2; denotes the linear least-mean-squares estimator of c that 
is based on the observation y,. Likewise, P, denotes the resulting error covariance matrix, 
PEZ. 


e At each node k, the notation @1., denotes the linear least-mean-squares estimator of æ that 
is based on the observations (9, ys, ..., Yg}, i.e., on the observations collected at nodes 1 
through k. Likewise, P., denotes the resulting error covariance matrix, P. = E 1,4214. 


FIGURE II.1 A distributed estimation network. 


The network functions as follows. Node 1 uses y, to estimate x. The resulting estimator, 21, and 
the corresponding error covariance matrix, P, = E 2,21, are transmitted to node 2. Node 2 in tum 
uses its measurement y; and the data (41, Pi} received from node 1 to compute the estimator of 
a that is based on both observations {y,,y 2}. Note that node 2 does not have access to y, but 
only to y; and the information received from node 1. The estimator computed by node 2, 1.2, and 
the corresponding error covariance matrix, P:;2, are then transmitted to node 3. Node 3 evaluates 
($13, Pus) using {y3, $1.2, Pi.2} and transmits (21.5, Pi.3} to node 4 and so forth. 
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(a) Find an expression for £1,m in terms of @1:m—1 and êm. 

(b) Find an expression for P7}, in terms of (P1. ,, Pn’, Rz 1). 

(c) Find a recursion relating Pim to Pi:m—1. 

(d) Show that Pi.m is a non-increasing sequence as a function of m. 

(e) Assume H; = H for all k and Ry, = Ry > 0. Assume further that H is tall and has full 


column rank. Find 


lim Pim 
m-—oo 


Problem li.23 (Random telegraph signal) The probability distribution function of a Poisson 
process with rate A is defined by 


(At)* e 
k! i 
where A is the average number of events per second and N is the number of events that can occur 
in the interval of time [0, t]. Now consider a random process a(t) that is generated as follows. Its 
initial value z (0) is +1 with probability 1/2, and it changes polarity thereafter with each occurrence 
of an event in a Poisson process. 
(a) Show that the probability that z(£) = +1 given that (0) = +1 is given by the expression 
P[z(t) = -1|z(0) = +1] = (1+ e7?*) /2. Repeat for P[æ(t) = —1|x(0) = -1). 
(b) Show also that P[æ(t) = —1|r(0) = +1] = (1—e7?™*) /2. Repeat for Pla(t) = 
+1|æ(0) = —1]. 
(c) Show that, for any t, P[r(t) = +1] = P[z(t) = —1) = 1/2. 
(d) Show that x(t) is a zero-mean stationary random process with auto-correlation function 
Re(r) = e-?^i7l, 


(e) Show that the linear least-mean-squares estimator of z(0) given both z(T') and z(2T) is 


P(N-k)- k20,1,2,... 


"S g 2AT _ -6AT 


z(0) = —i-ear- lT) 


Problem II.24 (Space-time coding) Consider a two-transmit one-receive antenna system, as 
shown in Fig. IL2. Let o denote the channel gain from transmit antenna one to the receiver and 
let 8 denote the channel gain from the transmit antenna two to the same receiver. The channels 
between the two transmitters and the receiver simply scale the transmitted data by the scalar gains 
(o, 8}; we say that they are flat channels since their frequency responses will be flat. Both gains 
(o, 8) are assumed known. In space-time coding, symbols {s1 (i), 82(i)} are transmitted at time i 
from the two antennas followed by the symbols {—s3(7), sï (i)} at time ? + 1. The corresponding 
received signals will therefore be 


r(i) = a81(t) + 8s2(i) + v(i), r(i + 1) = —as3(i) + Gsi(t) + v(i- 1) 


The noise sequence {v(-)} is white with variance o2 and uncorrelated with the data (1(:), 82(-)}. 
The symbols (51 (+), 82(-)} are assumed independent and of variance c? each. 


(a) Verify that 
.|[e B 81(i) + v(i) 
B* -o* 82(t) v*(i+1) 


—_— 
H 


r(i) 
r*(i+1) 


and that the matrix H satisfies H* H = (|a|? + |8|?)I. 


(b) Show that the linear least-mean-squares estimators of (s1(i), s2(i)) given both measure- 
ments {r (i), r(i + 1)} are given by 


à) | _ o; PEE 
E 7 Gila + (ae) +08 7 pu 


v(i) 


Q 
ion 
82(i) y 8 


FIGURE I.2 A two-transmit one-receive antenna system. 


sili) 


Remark. The idea of coding the data in the manner described in this problem in order to exploit transmit diversity, 
and subsequently simplify the structure of the receiver, is due to Alamouti (1998). 


Problem 1I.25 (Maximal ratio combining) Consider a one-transmit two-receive antenna sys- 
tem, as shown in Fig. IL3. Let o denote the channel gain from the transmit antenna to the first 
receiver and let 3 denote the channel gain from the same transmit antenna to the second receiver. 
Both channel gains are possibly complex-valued. A symbol s is transmitted and the received sig- 
nals by both antennas are y = as + v and z = 8s + w, where (v, w} denote zero-mean noise 
components that are uncorrelated with each other and with s and have the same variance c2. On the 
receiver side, the scalar signals (y, z} are to be combined linearly, say as 8 = ay + bz, in order to 
generate an enhanced signal ê with maximal signal-to-noise ratio. We wish to determine the values 


of (a, b]. 
I 


A 
I 
8 PESE S 
| 


w 


v 


FIGURE 11.3 A one-transmit two-receive antenna system. 


(a) Collect the measurements into vector form, 


and let r™ = [a | denote the desired combination vector. Argue that the SNR after processing 
by r is SNR = c?Ir* A|?/o2 ||r||?, where r*h is the inner product between r and h. 
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(b) Use the Cauchy-Schwartz inequality to conclude that a choice for r that maximizes the SNR 
is r = h and that the resulting maximum SNR is SNRmax = 02 (|al? + |8|2)/o2. Conclude 
that an optimal linear combination is 8 = a*y + 8*z. 


Remark. Maximal ratio combining is a classical technique due to Brennan (1959). As can be seen from the result 
of this problem, MRC is an optimal spatial diversity receiver that maximizes the output SNR and thereby reduces 
signal distortions caused by multipath propagation. This technique is employed in RAKE receivers, as discussed 
in Computer Project 11.3 of Sayed (2003). 


Problem II.26 (MMSE linear combining) Consider the same setting as Prob. II.25, except that 
now it is desired to determine the coefficients {a,b} by minimizing the mean-square error in esti- 


mating s from z, 
le diz] 


Use the result of Thm. 5.1 to show that the linear least-mean-squares estimator of s given {y, z} is 
given by 


2 


min E 
a,b 


oe 


1 * ` g* " 
2 (a y + B z) with m.m.s.e. = o3/o? + lal + |B 


= edel + loj? + {Bl 


Problem II.27 (Cyclic prefixing) Assume 9 data points, (s(0), s(1), s(2),... , s(8)). are trans- 
mitted through a channel with transfer function H(z) = h(0)+h(1)z7! +h(2)z~?, and collect the 
resulting channel outputs (z(2), z(3),..., 2(8)}. 

(a) Verify that the transmissions and measurements are related via 


s(8) 
2(8) hO) h(1) A(2) 3(7) 
z(7) hO) RO) h(2) s(6) 
z(6) h(0) h(1) h(2) a(5) 
z(5) | = h(0) A(1) h(2) s(4) 
z(4) (0) RO) A(2) 3(3) 
z(3) hO) hO) A(2) s(2) 
z(2) hO) A) A) | a) 

a —ÁÁ—————MM——M—ÓMM— 

z:7x1 H:7x9 L (0) 

3:9x1 


Remark. The channel matrix H defined above has a Toeplitz structure, i.e., it has constant elements along 
its diagonals. 
(b) Now assume that the transmitted data is designed such that s(8) = s(1) and s(7) = 


8 
That is, after 7 transmissions (s(0),..., s(6)), we repeat the first 2 symbols {s(0), s( 
Show that the transmissions and measurements are now related as 


(0). 
1)}. 


2(8) A(0) h(1) R(2) s(8) 
z(7) h(0) A(1) h(2) s(7) 
z(6) h(0) h(1) h(2) s(6) 
2(5) | = h(0) A) A(2) s(5) 
z(4) h(0) hQ) A(2) | | 8(4) 
z(3) h(2) h(0) h(1) | | s(3) 
z(2) h(1) h(2) A(0) s(2) 
M — 
z:7X1 H:Tx'T 8; 7x1 


Remark. Observe that the channel matrix H is now square. It still has a Toeplitz structure. However, 
in addition, it also has a circulant structure. This means that each of its rows can be obtained from the 
previous one by circularly shifting it to the right by one entry. 


(c) More generally, consider a channel with memory M, i.e., H(z) = h(0) + (1)! +... + 
h(M)z~™, and assume that we collect N + 1 measurements starting from some time instant 
i, ie. z is now (N +1) x 1 with data z = col{z(i+N),z(i+N-1),...,2(¢+1), z(i)]. 
Assume further that the transmissions are designed with a cyclic prefix, namely, the last M 
values of s(-) coincide with the M values of s(-) prior to time i, 


s(ic-N—k)-s(i-l-k) k=0,1,...,.M-1 


Let s = col(s(i + N),s(i -- N — 1),..., s(i + 1), s(i)) and assume N > M. Verify that 
z and s are related via an (N + 1) x (N + 1) circulant channel matrix H. 


Problem 1I.28 (Frequency-domain equalization) Independent and identically distributed data 
sí i). with variance c2, are transmitted through a known channel with transfer function 


H(z) = h(0) + h(1)z ^! --h(2)2 ? &- ...--h(M)a M 


A cyclic prefix is incorporated into the data, as explained in part (c) of Prob. 1.27. The output of 
the channel is observed under additive white noise, v(i), with variance o2, i.e., y(i) = z(i) + v(i). 
Collect a block of (JN +1) measurements, starting from some time instant i, y = col(y(i-- N), y(i4- 
N — 1)... y(i + 1). y (i), and define the corresponding data vector s = col(s(i + N), s(i + 
N-1),..., s(i + 1), s(i)). We know from Prob. II.27 that y and s are related via y = Hs + v, 
where H is (N +1) x (N + 1) circulant, and v = col{v(i+ N),...,v(i + 1), v(i)) denotes the 
noise vector. 


(a) Every circulant matrix H can be diagonalized by the discrete Fourier transform (DFT), i.e., 
it holds that A = FH F*, where F is the DFT matrix of size (N + 1): 


and A is diagonal with eigenvalues denoted by (A;) — see, e.g., Prob. VI.6. Define the 
frequency-transformed vectors y = Fy, s = Fs, and v = Fv. That is, (y, 8, v) are the 
DFTs of the vectors (y. s, v). Show that y = As + v and that the covariance matrices of 
(v, 8) are (021, c? T]. 

(b) Show that the linear least-mean-squares estimator of 3 is 8 = 92A" (o21 + ozAA*) = J. 
Conclude that the entries of (gj, 3} are related to each other via a simple scaling operation, 
namely, for the k-th entries: 

2 AEC H = 

s(k) = toi? y(k) 

Remark. The result of this problem explains the concept behind single-carrier-frequency-domain equalization (for 

more details see, e.g., Sari, Karam, and Jeanclaude (1995), Clark (1998), Al-Dhahir (2001), and Younis, Sayed, 

and Al-Dhahir (2003)). The data (s(2)) are transmitted with cyclic prefixing, and a block of measurements y 

is collected. The DFT of y is computed and its entries scaled according to the expression in part (b) above, 

by using the diagonal elements {Àx} of the frequency-transformed channel matrix H. The estimators $ are 
then transformed back to the time domain by undoing the DFT operation, i.e., 8 = F *S. The use of a cyclic 
prefix adds redundancy to the transmitted data and results in overhead costs. Therefore, there is a compromise 
to be balanced between the design of a simple receiver structure and the loss in information capacity. Cyclic 
prefixing is also useful in the context of orthogonal frequency-division multiplexing (OFDM), where the receiver 
structure can be derived in much the same manner as in this problem. We leave the details of OFDM to Computer 

Project VIL 1. where it will be shown how least-mean-squares (as well as least-squares) techniques are useful in 

the design of OFDM receivers. 


Problem II.29 (Multi-transmit multi-receive antenna system) Independent and identically 
distributed data (5: (-), $2(-)} are transmitted from two antennas and received by two antennas, after 
travelling through a multi-channel environment and being corrupted by additive white noises. Trans- 
missions from antenna 1 travel through channels {h11, 12}, which refer to the impulse response 
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sequences of the channels between transmit antenna 1 and receive antennas 1 and 2. Likewise, trans- 
missions from antenna 2 travel through channels (A22, k21}, which refer to the impulse response 
sequences of the channels between transmit antenna 2 and receive antennas 1 and 2. This scenario 
corresponds to transmissions through a MIMO (multi-input multi-output) channel — see Fig. 11.4. 
The signals {81(-), 82(-), v1 (-), v2(-)) are assumed uncorrelated with each other, with the signals 
having variances (c2, c22} and the noises having variances (o?,, o2; ). 


Ui (2) 


ai) 3 hu a vi) 


82(#) y " Y j^ yai) 
22 | 


va(i) 


FIGURE IL.4 A two-transmit two-receive antenna system. 


The transmissions from both users are not only distorted by the channels but they also interfere 
with each other. We wish to design a linear equalizer (receiver) in order to recover the transmitted 
data {81(-), 82(-)} from the corrupted measurements (y; (-), y;(-)). Assume in this problem that 
the transfer functions of the channels are 


hu(z) -1-az ,  ha(z)-1-fz ,  ha(z)-az ^,  han(z)= bz? 


in terms of (o, 8, a, b). Define the vectors of measurements and data at time i, 


and choose the receiver to be of length 3, i.e., it generates an estimate for s; as follows: 
8i = Woy; + Wiyi i + Woyiia 
where the (Wo, Wi, W2} are 2 x 2 coefficient matrices to be determined optimally in the least- 


mean-squares sense. 


(a) Verify that 


8i 

Vi DP z vi 
i-1 

Yi-1 | = IZ P Fs + | via 
i-2 

Vi-2 In P Vi-2 
Si-3 

v H v 


where P is the 2 x 2 matrix 


aja b 
ii 


Observe that the channel matrix H has a block Toeplitz structure with 2 x 2 block entries. 


(b) Let {Ry, Rs, Re} denote the covariances of (y, s, v]. Show that — 125 
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Dy 2 2 
Ry = D, E D, = Fal 2 + Ds m oa 2 
D Ove 052 
(c) Determine the receiver coefficients (Wo, W1, W2} and the resulting m.m.s.e. in terms of 
(H, Dy, Ds}. 
(d) Draw a block diagram representation for the receiver when {a = 1/2, 8 = —1/2,a = 1,b = 


1.03 zx ao = 1,02 = 022 = 1/2}. 
(e) Determine the receiver coefficients in the following situations: 


(c.1) a = 0, b = 0, and all other values as in part (d). 
(c.2) a = 0 and all other values as in part (d). 

(c.3) b = 0 and all other values as in part (d). 

(c.4) h21 = 0, h22 = O and all other values as in part (d). 


Problem I!.30 (MIMO channel estimation) Consider a scenario similar to Prob. IL29, except 
that now known symbols (s1(:), $2(-)} are transmitted from two antennas and received by two an- 
tennas, after travelling through a multi-channel environment and being corrupted by uncorrelated 
additive white noises with variances (02, 024). Transmissions from antenna 1 travel through chan- 
nels (hi1, hi2), while transmissions from antenna 2 travel through channels (A22, hoi}. The im- 
pulse response sequences are now modeled as zero-mean random variables, (hii, hi2, hoi, R22}, 
and they are assumed to be of first-order for simplicity (ie., each has two taps at most). Let 
h = col{hii, his, R21, 22} denote a column vector with the entries of the individual channels 
stacked on top of each other; its covariance matrix is taken as Ra = c?1 (assumed known). 


Define again 
A | uu) A | () 
5 Bb | Á Ee | 


where y; is random while s; is deterministic, and assume N + 1 measurement vectors y; are col- 
lected, say y = col(yo. yi... YN} 


(a) Verify that y is related to h via y = Sh + v for some (S, v) to be determined. 


(b) Determine an expression for the linear least-mean-squares estimator of h given y. Find also 
the resulting m.m.s.e. 


Problem II.31 (Lattice recursions) Consider a zero-mean scalar random process {y(i)}. For 
each time instant 7, let 


$(ji—1:i—m) = LLm.s. estimator of y(i) given the m past observations 
{y(i E 1), y(i EX 2), os „yli zn m)) 
g(i—-m-1l—1:i—m) = lLm.s.estimator of y(i — m — 1) given the same observations 


{y(i m 1), y(i E 2),...,u(i 3 m)} 


We refer to y(i|i — 1 : à — m) as the forward prediction of y(i) that is based on the previous 
observations. Likewise, we refer to g(t — m — 1]i — 1 : à — m) as the backward prediction of 
y(i — m — 1) that is based on the same observations, as indicated below: 


y(i-m-1)| y(i- m) y(i-2) y(i-1) 
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In one case, we are using the boxed observations to predict the future (hence, the designation for- 
wards prediction), while in the second case we are using the same observations to “predict” the past 
(and, hence, the designation backwards prediction). 


Let (f, (2), bm (i — 1)} represent the corresponding residual errors, also known as forward and 
backward estimation errors, 


fn) = wi)-d$i-1:i—-m) 
bmli—-1) = yi-m-1)-g9Gi-m-li-1l:i-m) 
and denote their variances and cross-correlation by 
CA) Elf ml, — C G-1-ElbsG- 1), Smhi) = Eb (E - 1)F7,(8) 


(a) Show that the forward and backward errors, as well as their variances, satisfy the order-update 


relations: 
AG) és (2) / Go (i — 1) 
Mm) = óm()/G. (i) 
EESTI) m fA) = kh (i)bm(i id 1) 
bm+i(t) = bs(i-1)- Kin (1) f m () 
GuaG) = GG) — 6m) / 6 (i — 1) 
Chali) = GN — 1) ~ 6m G)?/ 0.) 


in terms of the so-called reflection coefficients (12, (i), «5, (i)). Remark. Later, in Part X (Lattice 
Filters), we study order-recursive least-squares (as opposed to least-mean-squares) lattice filters. It is 
instructive to compare the above lattice recursions with the recursions appearing in Tables 40.1 and 42.1. 
(b) Define the normalized reflection coefficient «2, (i) = ô% (4)/C2/?(i — 1)¢4/? (i). Show that 
(em (2)| € 1. 
(c) Assume |«2, (2)| < 1 for all i and m, and define the normalized errors by, (i) = bm OVALO) 
and f% (i) = f, (i) ets ? (4). Show that the normalized errors satisfy the recursions: 


fua) JI [F7, (i) — Kh (i55, (i — 1)] 
bo) : 


Viana [bm(i— 1) — &m (i) fanl] 
in terms of the single reflection coefficient «7, (1). 


(d) Assume now that {y(z)} is a wide-sense stationary random process so that Ey(i)g'(j) = 
Ey(i — k)y* (j — k) for all k. Show that, in this case, the quantities (C7, (i), C5, (i), «2, (i)} 
become independent of i and that the normalized errors ( f, (i), b% (i)) become related as 
shown in Fig. IL5. 


al 
fi) Vika fimo) 


Um C) bmt (Ù) 


VA-IÉ 


FIGURE 1l.5 A lattice section showing how the normalized forward and backward estimation errors for 
stationary processes are related in terms of the normalized reflection coefficient. 


Problem II.32 (Levinson-Durbin algorithm) Refer to Prob. II.31 and assume, as in part (d) of 
that problem, that the process {y(i)} is stationary. Express the forward residual error as 


Fin) = y(t) + am(1)y(i — 1) + am(2)y(i — 2) +... + am(m)y(i — m) 
for some coefficients {am(j),j = 1,..., m). Let c(k) = Ey(i)y" (i — k). 
(a) Show that the backward residual error is given by a similar expression: 
b, (t—1) = yli-m—1) +a}, (m)y(i-1) +.43,(m—1l)y(i-2)+...+47,(l)y(i-m) 


in terms of the conjugated coefficients {añ} (j), j = 1,...,m]. 


(b) Collect the {am(j)} into a row vector as am = [am(m) am(m — 1) ... @m(1) 1]. Show 
that am satisfies the so-called Yule-Walker equations 


QmTm = | Q0 ... 0 ch ] 
where C7, = E |f, (i)|? and Tm is an (m + 1) x (m + 1) Toeplitz matrix whose first column 
is col(c(0), c(1),...,e(m)}, 
c(0) c" (1) c*(2) - c' (m) 
c(1) c(0) c'(1) es C(m-1) 
Tm & | c(2) c(1) c(0) es C(m-2) 
dj dime 1) roe 2) ts ¢(0) 


Show further that (5, = Cf, where 65, = E|bm(4)|?. 

Refer to the recursions of part (a) of Prob. II.31. Argue that the reflection coefficients are 
now related as kf, = «57. Let xÍ, = km. Deduce the following so-called Levinson-Durbin 
recursions for the prediction vectors {am }: 


(c 


— 


where a# S [ 1a5,(1) ... aj, (m)]. 


(d) Show also that sm = Çh (1 — |Km|?) and 


elm +1) + S`am(m +1- ije(i) 
i=l 
Gy 


Remark. The Yule-Walker equations were introduced by Yule (1927) in his studies on fitting an autoregressive 
model to sunspot data. The efficient algorithm of parts (c) and (d) for order-updating the solution of the Yule- 
Walker equations was derived by Durbin (1960) and earlier (in a more general context) by Levinson (1947). 


Km = 


Problem 11.33 (Block signal processing) Consider a 6—th order FIR filter with transfer func- 
tion 
Ko(z) 2 o(0) + a(1)27! + a(2)27? + a(3)27? + a(4)27* + a(5)z™* 


Its so-called 3rd-order polyphase components {E;(z)} are defined via the identity 
Ko(z) = Eo(z*) +27) E1 (2°) + 27? Ea(z?) 


where each E;(z) is a polynomial in z^! of degree one. 
(a) Verify that Eo(z) = o(0) + a(3)27!, E1(z) = a(1) + o(4)z ^, and E2(2) = a(2) + 
a(5)z7+ 
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(b) Let (z(n), y(n)) denote the input and output sequences of the fullband filter Ko(z). Define 
the vectors of size 3 each, 


z(3n) y(3n) 
Tn £ z(3n—1) |; Yn E y(3n — 1) 
z(3n — 2) y(3n — 2) 


Verify that the matrix transfer function that maps £n to y is given by 


Eo(z) Ei(z)  Ea(z) 
Ko(z) = | 27) E2(z) Eo(z) Ei(z) 
2 Ez) z` Ea(z) Eo(z) 


Remark. The matrix transfer function Ko(z) allows us to evaluate the output signal on a block-by-block basis. 
The terminology of fullband filters and polyphase components is common in multirate digital signal processing. 
The transfer matrix Ko (z) shown above has a so-called pseudo-circulant structure. A pseudo-circulant matrix is 
a circulant matrix with the exception that all entries below the main diagonal are further multiplied by the same 
factor z 1. A circulant matrix is a Toeplitz matrix (i.e., one with identical entries along the diagonals) with the 
additional property that its first row is circularly shifted to the right in order to form the other rows. We shall 
return to the result of this problem later in Chapter 26. 


Problem 11.34 (Oblique projection) Consider zero mean-random variables {x, y, z. d) and let 
Rya = Eyd*, Rzy = Ezy*, and Rza = Emd". We want to determine estimators for x and z such 
that: 


$ = K2y,for some coefficient matrix K? to be determined. 


£ = K2d, for some coefficient matrix K? to be determined. 


12. dX, 


The estimation error $ = x — @ is orthogonal to d, i.e., Exd* = 0. 
4. The estimation error Z = z — Z is orthogonal to y, i.e., EZy* = 0. 


We refer to Z as the oblique projection of z onto d because the resulting error, Z, is not orthogonal 
to d but rather to y. In other words, we are projecting z onto d along a direction that is orthogonal 
to y. Likewise, we say that & is the oblique projection of x onto y. 


(a) Assuming Ray is invertible, show that K? = RyRy, and Kz = Rra Ry n 


(b) Assume that (y, d, x, z} are related via y = Hz + v and d = Az + w, with Evw* = W, 
Ezax* = II, Evz* = 0 and Ewz" = 0. Determine 2 and ĉ in terms of { H, A, II, W, y, d). 


Remark. Oblique projections arise in the study of instrumental variable methods in system identification (see, 
e.g., Sóderstróm and Stoica (1983)). They also arise in some communications and array processing applications 
and in higher-order spectral (HOS) analysis methods (see, e.g., Kayalar and Weinert (1989) and Behrens and 
Scharf (1994). See also Prob. VII.23). A Kalman-type filter for performing oblique state-space estimation was 
developed by Sayed and Kailath (1995). 


Problem II.35 (Valve pressure) Let po denote the initial pressure in a valve and assume that it is 
decreasing exponentially. Three noisy measurements of the pressure in the valve are available at time 
instants {to, t1, t2), say y(i) = poe ^" + v(i) for i = 0, 1,2, where the (v(i)) are uncorrelated 
zero-mean random variables with variances 62 i. Let Ĥo denote the minimum-variance-unbiased 
estimator of po given {y(0), y(1), y(2)). Show that 


—2to -2t, —2t2N —1 —to -tı —t2 
z e e e e e e 
Bo = (5 To—— + ) ( y(0) + y(1) + ya) 
2 7,0 oža 02.3 02.0 a? 02.2 


and that the resulting m.m.s.e. is 


Problem 11.36 (Constrained mean-square error) Let d denote a scalar zero-mean random vari- 
able with variance c2, and let u denote a 1 x M zero-mean random vector with covariance matrix 
R,, = Eu*u > 0. Consider the constrained optimization problem 


min E|d — ww|? subject to c*w =a 
w 


where c is a known M x 1 vector and o is a known real scalar. 


(a Letz = w — Ry 1 Ba, and Ra, = Edu*. Show that the above optimization problem is 
equivalent to the following: 


min [ (ca — RuaRyz' Rau) + z' Ruz] subject to c'z = o — c' Rz Rau 
z 


(b) Use Remark 6.1 to conclude that the desired optimal solution, w°, of the constrained opti- 
mization problem is given by 


Verify that this solution satisfies the constraint c*w° = a. 


(c) Alternatively, arrive at the same expression for w^ by using a Lagrange multiplier argument. 
Specifically, introduce a complex Lagrange multiplier A and consider the extended cost func- 
tion Je(w, A) = E|d — uw|? + 2Re[A(c*w — o)], in terms of the real part of A(c*w — a). 
Set the individual gradients of Je (w, A) with respect to w and with respect to A equal to zero 
and determine w°. 


Problem Il.37 (Eigenfilters) Let y = s + v be a vector of measurements, where v is noise and 
s is the desired signal. Both v and s are zero-mean uncorrelated random vectors with covariance 
matrices ( Ry, Rs}, respectively. We wish to determine a unit-norm column vector, w, such that the 
signal-to-noise ratio in the output signal, w* y, is maximized. 

(a) Verify that the covariance matrices of the signal and noise components in w*y are equal to 
w“ Rew and w* R,w, respectively. 

(b) Use the Rayleigh-Ritz characterization of eigenvalues from Sec. B.1 to conclude that the 
solution of maxq,j-i w"R,w is given by the unit-norm eigenvector that corresponds to the 
maximum eigenvalue of Rs, written as w? = qmax, where Rsqmax = AmaxQmax. Assume 
that R, = c?I. Verify further that the resulting maximum SNR is equal to Amax/ c2 A 

(c) Assume now that v is colored noise and introduce the Cholesky factorization R, = LL* 
(the Cholesky factorization of a positive-definite matrix is described in Sec. B.3). Repeat the 
argument of part (b) to show that the solution of 


w* Rew 
max 
[wii Nw* Row 
is now related to the unit-norm eigenvector that corresponds to the maximum eigenvalue of 
Lo RSL", 


Problem 11.38 (Space-time coding) Consider the same setting as in Prob. II.24, except that now 
we are interested in estimating the channel gains (o, 8), which are assumed to be unknown con- 
stants. For this purpose, known symbols (5s; (4), s2(¢)} are transmitted at time i from the two anten- 
nas followed by the symbols ( — 55 (1), sï (i)} at time i+1. Assume the symbols have unit magnitude, 
ie., sc (i)? = 1 for k = 1,2. 


(a) Verify that the received signals are given by 


| r(i) -| si(i) s2(i) | 5| A | v(i) | 
r(i 4-1) —s$(i) sii) 8 v(i 4 1) 
—— 


S 
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where the matrix S satisfies S*S = (|s:(i)|? + |s2(4)|?)I = 21. 


(b) Find the minimum-variance-unbiased estimator of (o, 3) given (r(i), r(i + 1)}. Find also 
the resulting minimum mean-square error. 


Problem il.39 (Constrained optimization) Consider the more general optimization problem 


min KR,K" subjetto KH —A and R,»0 


where K is m x N, His N x n, Ais m x n, n < N, m < N, and H has full rank. In the 
text we assumed A is square and equal to the identity matrix (see (6.14)). Show that the optimal 
solution is given by Ko = A(H* R;! H) 1 H* R;! and that the resulting minimum cost is equal to 
A(H* RJ H) A*. 


Problem II.40 (Comparing linear and decision-feedback equalization) Refer to Sec. 6.4 
and set B(z) = 0. The resulting structure would become a linear equalizer. The purpose of this 
problem is to show that the m.m.s.e. when B(z) = 0 is larger than or equal to the m.m.s.e. for 
nontrivial feedback filters. 


(a) When B(z) = 0, and using (6.27), argue that the m.m.s.e. of the resulting linear equalizer is 
equal to ej Rseo where eo = col(1,0,0...,0). That is, the m.m.s.e. is equal to the (0,0) 
entry of Rs. 

(b) We know from (6.29) that the m.m.s.e. of the DFE design is given by 1/(e9 R5 ! eo). We 


would like to show that 1/ eg R5 'eo < ej Rseo. Let us establish a more general result. Let A 
denote any positive-definite Hermitian matrix and partition it as 


a z* 


A= 
zr B 


where a is a positive scalar, z is a column vector and B is also Hermitian positive-definite, by 
virtue of the positive-definiteness of A — see Sec. B.3. Show that the (0,0) entry of A^! is 
positive and given by [A^ !]oo = (a — z' B^!z) !. Conclude that [A]oo > 1/[A^!]oo. 


Problem 11.41 (DFE as linear estimation) Refer to the discussion at the end of Sec. 6.4. We 
wish to show that expressions (6.36) and (6.37) lead to the same solutions (6.29)-(6.30). 


(a) Partition the vector r into r = col{S,y,}, with s denoting the top entries of r that are 
dependent on s(-), and y, is as in (6.23). Using (6.23), verify that 


"ES s(i — A) R, a Err* = Rs Rzy 
3 i Rs Ry 


where Rz = Ess” and Rs, = Ey". Check also that 


Rer =Ea(i-A)r* = | Esi- A)" — Esi- Aye | 


(b) Let R: = Rs — Rs, Ry ' Rys. Verify that 


R= I 0 R; I -RgRA,! 
" -Ry' Ry I Ry I 


(c) Substitute the expressions for Rs- and Rz ! into (6.36) and show that k7 evaluates to kx = 
[-8* a*], where the row vectors {a*, 8") are given by 


8' =-[Rsu-ays — Rsi-ayy Ry Rys] Rz*, ot-fÀ'BRRV + Rei-ayy Ry 


* 


where Rei-ays = Es(i — A)j8" and R,(/LAy, = Es(i — A)y". 


(d) Verify that the matrix Re in (6.24) is given by — 131 
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R; = o? — R,ü- ay Ry Rysti- A) Rsui-ays — Rsü- Ay Ry Rys PROBLEMS 
Rzsți-4) — Rzy R} Rys(i—a) R: 


Evaluate the first row of E; ! and show that 


ds Dn] 


Conclude that b5,, = [1 8"] and f, = o^. 


(e) Show that expression (6.37) for the m.m.s.e. coincides with (6.29). 


Problem 1I.42 (DFE for a multi-transmit multi-receive antenna system) Refer to Prob. II.29. 
We reconsider the two-transmit two-receive antenna system of that problem and proceed to design a 
decision-feedback equalizer for it, as opposed to a linear equalizer. The arguments extend the deriva- 
tion of Sec. 6.4 to the MIMO case. Although we focus on a 2 x 2 channel model, the results can be 
extended to multi-antenna systems in a straightforward manner. 

At each time instant 7, define the vectors of received samples, transmitted symbols, and delayed 


decisions, 
a | wi) s1 (i) : ái(i— A) 
= y ’ $i — x D Si-A = zc 

d | y(i) | t | 82(i) ' 52(t — A) 
The measurement vectors (y;) are fed into a feedforward FIR filter with 3 matrix taps (i.e., L = 3) 
of dimensions 2 x 2 each. Likewise, the delayed decisions are fed into a feedback FIR filter with 3 
matrix taps, including Bo, of dimensions 2 x 2 each, so that Q — 2 (in Sec. 6.4 we defined Q as the 
number of taps excluding the direct path coefficient Bo). The input of the decision device is given 
by 

i-a = Foy; + Fiyi-  Foyj 2 — BoBi-4 — BiBi-A-1 — Bo8i-a-2 
where the {Fk} and the (— Bp} denote the matrix coefficients of the feedforward and feedback 
filters. It is assumed in the sequel that the decisions that are fed back are correct so that 8j 4 -,, = 
Si-a -p for k = 0, 1, 2. Our objective is to select the {F}, Bj.) in order to minimize the covariance 
matrix of the estimation error, namely, 


: ~ ~ 
min ES;-AS8;. 
Fy Br a i-A 


(a) Collect the signals within the feedforward and feedback filters into the following vectors: 


; Si-A 

E d 
Vii-2 — Vi-i E SA — Si—-A-1 
Vi-2 Si-A-2 


and define the corresponding covariance matrix, assumed positive-definite, 


RAE SA SA ES Rs Rey 
Vii-2 Yiia Rys Ry 


Define the filter matrices F = | Fo FF | and B = | I2+Bo Bi B3 |: Verify 
that 8;-, = Bsa — Fy,,,_,. and show that the optimal coefficient matrices {F, B) are 


related via 
Fopc = BoptRsy Ry" 
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(b) Consider the situation in which only previous decisions from users 1 and 2 are available for 
feedback. That is, set Bo = 0. Show that the optimal choice of B is obtained by solving 


min BRsB* subjectto BY =l, Y Ê | k | 


where Rs = Rs — Rey Ry 1Rys. Use the result of Prob. 11.39 to conclude that Bop: = 
(V* R5 1)! V* R5, and that the resulting minimum mean-square error is m.m.s.e. = 
(V* R;w)-!. 

(c) Introduce the vectors 


8i 
ý vi ; 
i-i a a | vi) 
Sii-3 = E UVii-2 = Vi-1 |> vi — ; 
$i-2 il va(t) 
i— 
Si-a 


Verify that y; 9 = Hsii-s + vii-2, where the channel matrix H has a block Toeplitz 
structure and is given by 


HA LP 


for some 2 x 2 matrix P. 
(d) Let Re = Evi-2U1;.2 and Rs = E 8i;~387,;_3. Show that 
Ry = HR,H* + Ry and Rey E (E SASi.i-3) H' 


If the maximum number of taps among the channels (hi1, hi2, h21, h22) is denoted by M, 
show that A should satisfy 


A+Q<L+M-2 
in order for the term E $4 87.;..3 to have the form E sa S7... = [ 0 R, 0 ] Verify that 
the leading zero block has dimensions 2(Q + 1) x 2A, and that E s4 57; is 2(Q + 1) x 
2(L 4- M — 1). 
(e) Assume R, = 071, which arises when the sequences {s1 (+), s2(-)) have the same variances. 
Verify that 
-1 
Rs = ® (2: + H'RH) $* 
os 
where $ = [ 010 | . Identify the sizes of the blocks in ®. 


Remark. The results of Probs. II.42-1I.45 on MIMO DFE are based on the work by Al-Dhahir and Sayed (2000). 


Problem 1I.43 (Selection of the delay) Consider the same scenario as in Prob. 11.42 but assume 
that A+Q=L+M -2. 


(a) Show that Rs 2-0 (Rz! + H'RSH)' o", where ® = [02(9.11)x24 12(9+1)x2(Q+)}- 


(b) Let X = (Rz! + H' R; 1H); it does not depend on A while © does. This observation 
suggests a way for selecting the values of (A, Q). Use the result of part (b) of Prob. 11.42 
to verify that the m.m.s.e. is given by m.ms.e. = (V*(PX- 19") !w)-^!, where V = 
col(I2, 0). Introduce the triangular factorization X = LDL*, where L is lower-triangular 
with unit diagonal entries and D is diagonal with entries (di) (cf. Sec. B.3). Express the 
m.m.s.e. in terms of the {d;}. Which choice of {Q, A} minimizes the trace of the m.m.s.e.? 


Problem 11.44 (Ordered decisions) Consider the same scenario as in Prob. I1.42, but now as- 
sume that only decisions from user 1 (current and past) are available for use by user 2, and that only 


past decisions of user 1 are available for his use. In other words, only decisions from the lower- 
indexed user are available for use by the higher-indexed user. This case corresponds to restricting 
the coefficient Bo to being strictly lower-triangular, so that the determination of Bop: reduces to a 
problem of the form 
5 min BRsB"  subjectto BY = I+ Bo and Bo strictly lower triangular 
0.221,22 
(a) Let C = I + Bo. Use the result of Prob. II.39 to argue that the solution to the above prob- 
lem is given by Bopt = Cop(V* Rs V). V" R; ^, where Copt is obtained by solving the 
optimization problem 
min C(W'R;ly)c* 
subject to C being a lower triangular matrix with unit diagonal entries. 


(b) Let X = YW*R V. That is, X is the leading block of Rj’. Introduce the triangular factor- 
ization of X, say X = U;D4U; where U; is upper-triangular with unit entries (cf. Sec. B.3). 
Show that Copt = Už. Conclude that the resulting m.m.s.e. is equal to D>. 


(c) Show that the trace of this value for the m.m.s.e. is lower than the trace of the m.m.s.e. found 
in part (b) of Prob. II.42 (i.e., show that Tr(Dz !) < Tr(X^1)). 


Problem Il.45 (Multistage detection) Consider the same scenario as in Prob. IL.42, but assume 
now that current and past decisions from user 1 are available for use by user 2 and, likewise, current 
and past decisions from user 2 are available for use by user 1. This corresponds to a situation in 
which current decisions for all users are obtained from some previous detection stage. This case 
restricts the coefficient Bo to having zero diagonal elements. 


(a) Follow the arguments of Prob. 1.44 to show that 
Bopt = Co (V Rs V) ! V R;! 
where Cop: = I + Bo,op is obtained by solving the optimization problem 


min CX7'C™ subject to elCe; 21 fori=0,1 


where X = V* Rz; V, eg = col{1, 0}, and ex = col(0, 1}. 


(b) Show that the two rows of Copt are given by 


(c) Find the resulting m.m.s.e. Show that its trace is smaller than the traces of the m.m.s.e. 
values of Probs. II.42 and II.44. 


Problem II.46 (Moving average process) Let y, = v. + vci, k > 0, where (v;,j > —1} 
is a zero-mean stationary white-noise scalar process with unit variance. Show that 


z k+1 5 
Yk+ijk = Ea Ye € Ükik-1) 


Remark. For Probs. 11.46-H.48, refer to the material in Chapter 7. 


Problem ll.47 (Correlated signal and noise processes) Consider the model y; = zi + vi, 
with E v;vj = Rjó;j, Eviz} = 0 fori > j, and Eviz; = Di. All random variables are zero-mean. 
Let Re,; denote the covariance matrix of the innovations of the process {y;}. Let also Z;); denote 
the {.|.m.s.e. of z; given {y;,0 € j € i). Show that 2;; = y; — (Di + Ri) Hei. 


Problem {1.48 (Filtered residuals) Consider a process y; = Hix; + vi, where (vi) is a white- 
noise zero-mean process with covariance matrix R; and uncorrelated with the zero-mean process 
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{xi}. The filtered residuals of {y,} are defined as v; = y; — Hi&;. Show that the (vi) form a 
white-noise sequence with covariance matrix Ry, = Ri — Hi Pip Hj. 


Problem II.49 (State-space estimation) Consider the state-space model z;,1 = Fzi-c-Giui- 
Gouisi, y; = Ha; + vi fori > 0, with zero-mean uncorrelated random variables (zo, ui, vi} 
such that E uiu} = Qidiz, Evivj = Riôij, E coxo = IIo. Find recursive equations for #;);_; and 
Pyi-1. 


COMPUTER PROJECTS 


Project il.1 (Linear equalization and decision devices) Consider the three-tap linear equal- 
izer discussed in Ex. 4.1, and studied further in Prob. I.17 where the optimal equalizer coefficients 
were determined for several values of A. The equalizer structure is shown again in Fig. II.6. Symbols 
(s(i)) are transmitted through an FIR channel and corrupted by additive white noise {v(z)}. The 
received signal {y(i)} is processed by a linear equalizer to generate estimators {8(2 — A)}, which 
are further fed into a decision device, as explained in parts (c) and (d) below. The purpose of this 
device is to map each 8(i — A) to the closest symbol in the constellation; the result is denoted by 
&(i — A). Choose initially a? = 0.004 so that, according to part (f) of Prob. II.17, the signal-to-noise 
ratio at the input of the equalizer is approximately 25 dB (the dB value is obtained by computing 
10 log(-)). The purpose of this project is to examine the performance and operation of this three-tap 
equalizer. 


v(i) 


Decision 
device 


FIGURE II.6 Linear equalization of an FIR channel. 


(a) Write a program that evaluates the optimal equalizer coefficients for A = 0, 1, 2, 3 and for 
arbitrary noise variance c2. Use the program to feed 2000 BPSK symbols (s(i)) into the 
channel 1 + 0.527! and generate the corresponding equalizer outputs (8(i — A)}, for A = 
0,1,2,3. Plot the scatter diagrams of {s(i), y(i),$(i — A)}. For each A, estimate the 


m.m.s.e. by computing 
N 


sox X de aor 
i=A+41 
Compare the resulting values with those obtained from theory (cf. Prob. II.17). Plot also the 
scatter diagrams of {y(i),8(i — A)} for A = 0 when o2 = 0.05, which corresponds to 
SNR= 14 dB. Compare with the scatter diagrams at 25 dB. 


(b) How would the equalizer coefficients (o(0), o (1), a(2)) change if the input signal {s(i)} 
were instead chosen uniformly from a QPSK constellation, i.e., s(i) € {V2(+1 + j)/2)? 
Repeat the simulations of part (a) for QPSK data with o2 = 0.004 (and also c2 = 0.05). 
Now, however, the white noise sequence (v(i)) needs to be complex-valued. In order to 
generate such a sequence with variance c2, simply generate two separate real-valued white 
noise sequences {a (i), b(i)) with variance 02/2 each. Then set v(i) = a(i) + jb(i), where 


j-4-. 


— 


(c 


— 


= 


(d 


(e 


— 


(f 
(g 


< 


For all cases in part (a), assume now that the output of the equalizer is applied to the nonlinear 
decision device: 
"EDT if s(i — > 
sign|8(i — A)] = T if a A)z0 
-1 if ŝ(i- A) «0 
Determine the number of erroneous decisions for each A. Which A results in the smallest 
number of errors? 


For all cases in part (b), assume that the output of the equalizer is applied to the nonlinear 
decision device: 


dec(8(i — A)] = 2 { sign[ Re(&(i — A))] + jsign[ Im(8(i — A))] } 


which assigns 8(i — A) to the closest symbol in the QPSK constellation. Determine the 
number of erroneous decisions for each A. Which A results in the smallest number of errors? 


Fix A = 1 and vary the value of o2 between 0.004 and 0.2 in increments of 0.001, so that the 
SNR is varied between 8 and 25 dB. Write a program that generates a plot showing how the 
symbol error rate (SER) varies with SNR. [The SER is defined as the number of erroneous 
symbols relative to the total number of transmissions. ] 


Repeat part (e) for QPSK data, by repeating the simulations of parts (b) and (d). 


In order to visualize the improvement that is provided by the presence of the linear equalizer, 
assume that the received signal {y(z)} is applied directly to the nonlinear decision device 
(i.e., let us remove the equalizer). The output of the decision device is then taken as (i — A). 
For both cases of BPSK and QPSK data, generate plots that show how the symbol error rate 
varies with the SNR. Compare these plots with the ones obtained in parts (e) and (f) for small 
and large values of oĉ. 


Project !I.2 (Beamforming) Refer to the discussion in Sec. 6.5 on antenna beamforming, and 
consider a uniform linear antenna array consisting of 4 elements spaced by d = 4/5. 


(a) 


(b) 


(c) 


Assume first that the noise signal at each antenna is white with real and imaginary parts that 
are Gaussian with variances 0.1 each (hence, c2 = 0.2). Assume also that the noise signals 
across the antennas are uncorrelated with each other (i.e., assume spatial whiteness in addition 
to temporal whiteness). Design an optimal beamformer with unit response along the direction 
30? and find the theoretical m.m.s.e. Simulate the operation of the beamformer by using as 
baseband signal the sinusoid s(t) = cos (27t). Plot portions of the baseband signal s(-) and 
the signals received at the antennas (sampled at the rate of 100 samples per second). Plot 
also s(-) along with the real part of the output of the beamformer. Estimate the m.m.s.e. and 
compare it with the theoretical value. 


Assume now that the input signal is the sum of two sinusoids arriving along different direc- 
tions; one impinges on the antennas at 30°, while the other arrives at 45°. Simulate again 
the operation of the beamformer by using as input signal s(t) = cos (27t) + 0.2sin (47t), 
where the latter sinusoid is the one arriving at 45?. Assume that the sampling rate is still 100 
samples per second. Estimate again the m.m.s.e. and compare it with the theoretical value. 
Can you explain the difference? 


Assume now that the noise signals across the antennas are spatially correlated with an expo- 
nential auto-correlation function equal to 0.85!^!, so that 


1.0000 0.8500 0.7225 0.6141 
0.8500 1.0000 0.8500 0.7225 
0.7225 0.8500 1.0000 0.8500 
0.6141 0.7225 0.8500 1.0000 


v= 


In order to generate a noise vector v with this spatial auto-correlation function, we proceed as 
follows. Use the command chol of MATLAB to determine the 4 x 4 lower triangular matrix L 
such that LL* = R, (every positive-definite matrix R, admits a unique factorization of this 
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form — see Sec. B.3). Then generate a complex-valued M x 1 vector e with unit-diagonal 
covariance matrix by using, say, the command randn, and let v = Le. Show that the vector 
v so generated has covariance matrix R,. Now repeat the simulations of parts (a) and (b) for 
this case. 

(d) Consider again the case of spatially white noise, as in part (a), and assume that the beamformer 
has been designed optimally for 30°. Now let s(t) = 1 be the input signal and perform 
a simulation that varies the direction of arrival of s(t) between 1° and 180° in increments 
of 1°. For each direction 0, determine the output of the beamformer, 3(£), and estimate its 
average power, 


Pe) 2D BOP 
i=0 


where N is the number of samples of ŝ(t), and 3(i) its i-th sample. The power gain of the 
beamformer at direction @ is given by 10 log P(@). Use the command polar of MATLAB to 
generate a plot of the power gain of the antenna as a function of 6. Simulate two cases: a 
beamformer with 4 antennas, as above, and a beamformer with 25 antennas. 


Project II.3 (Decision feedback equalization) In this project we study the performance of de- 
cision feedback equalization for the channel 


C(z) = 0.5 + 1.2271 + 1.5277? — 27 


The symbols (s(i)) that are transmitted through Cz) are i.i.d. and chosen from a QPSK constella- 
tion, i.e., each s(2) is selected randomly from the set 


{+2 + i$). jeu 


The noise sequence {v(i)} is assumed i.i.d. and complex-valued; its real and imaginary parts are 
uncorrelated Gaussian random variables with variances 0.039 each, so that c2 = 0.078 and the SNR 
ratio at the input of the equalizer is approximately 18 dB. We start with L — 13 and Q — 2. 
(a) Plot the impulse response, as well as the magnitude of the frequency response, of the channel. 
Is the channel minimum phase? 
(b) Generate N = 2000 QPSK data points (s(i)) and transmit them through the channel. Plot 
the scatter diagrams of the transmitted sequence (s()) and the received sequence {y(i)}. 
(c) Compute the optimal filters {f3pt, 555.) for values of A in the interval 0 € A < 15 and 
generate the sequences {&(i — A)} and (8(i — A)) at the input and output of the decision 
device, which is defined by the equation 


dec[z] = z {sign[ Re(z) ] + jsign[ Im(z) ]) 


Plot the number of erroneous decisions as a function of A. For A = 5, plot the scatter 
diagrams of the received sequence {y(i)} and of the input to the decision device, {8(i — 5)]. 


(d) For each A, compute the theoretical m.m.s.e. by using m.m.s.e. = 1/e] R; ‘e1, and plot its 
value as a function of A. Using the actual data, estimate the m.m.s.e. by computing 


Compare the resulting values with the theoretical values. Can you explain why there is a bad 
fit between theory and practice for smaller values of A? Plot also for A — 5, the following 
sequences on three separate subplots: 


(i) The channel impulse response sequence. 


(ii) The impulse response sequence of the cascade combination of the channel and the feed- 
forward filter. 
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(iii) The impulse response sequence of the feedback filter delayed by the value of A. 


You will observe that the sequence in part (ii) has an almost unit-magnitude sample at time 
instant 5, followed by two nonzero samples that correspond to what we call post inter-symbol 
interference. This interference should be cancelled by the feedback filter. Any residual ISI 
prior to the peak sample at time instant 5 will not be equalized. Compare the coefficients of 
the feedback filter in (iii) to the values of the post ISI. 


Fix A = 5 and let us now examine the effect of changing the length of the feedforward filter. 
Generate a plot showing the number of erroneous decisions as a function of L, for L varying 
between 1 and 15. Keep Q fixed at Q — 2. Which value of L results in the smallest number 
of errors? 


Now fix A = 5 and L = 6, and let us vary Q. Generate a plot showing the number of 
erroneous decisions as a function of Q, for Q varying between 1 and 6. Which value of Q 
results in the smallest number of errors? 


Now fix A = 5, L = 6, and Q = 1. That is, the feedforward filter has 6 taps and the feedback 
filter has a single tap. In all derivations and simulations so far we assumed c2 = 0.078. Now 
let c2 vary between 0.12 and 0.78, say in increments of 0.001. Write a program that generates 
a plot showing how the symbol error rate (SER) varies with SNR. 


Let us now compare the performance of the DFE with that of a linear equalizer for the same 
channel. Recall that we studied linear equalizers in Computer Project II.1. Write a program 
that determines the optimal linear equalizer for L varying between 1 and 10. The output of the 
equalizer is fed into the decision device. Generate a plot that shows the number of erroneous 
decisions as a function of L. Use c? = 0.078 and A = 4 for the linear equalizer. Fix L = 4 
for the linear equalizer and plot the scatter diagrams of the received sequence (y(i)) and 
of the input to the decision device, {8(i — 4)}. For this particular channel, do you see any 
advantage in using the DFE structure over the linear structure? 


Now assume the channel C (z) and the noise variance o2 are not known beforehand but that 


we know the first 200 transmitted symbols {s(i}}, in addition to the entire received data record 
{y(2)}. Use the initial 200 data (s(i), y(1)) to estimate C (z) and o2, as explained in Sec. 6.3. 
Note that while the coefficients of the actual channel C(z) are real-valued, the estimated 
coefficients will in general be complex-valued. You may use the complex-valued estimates, 
or you may keep only their real parts. If the estimates are good enough, their imaginary 
paris should be small compared to the real parts. Plot the impulse and frequency responses of 
the estimated channel and compare them with that of the actual channel. Repeat the design 
of the DFE equalizer by using the estimates of C(z) and c2 instead. Use o2. = 0.078, 
L —6,Q = 1, and A = 5. Compare the number of errors in this case with the one obtained 
in part (g) using the exact channel model and the exact noise variance. 


Now repeat part (i) using a linear equalizer of length 4, followed by the nonlinear decision 
device. Compare the number of errors you get in this case with that obtained in part (h) and 
also with the DFE. 
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CHAPTER 8 


Steepest—Descent Technique 


The earlier chapters discussed in some detail the theory of least-mean-squares estima- 
tion and highlighted several applications in the context of channel equalization, channel 
estimation, and antenna beamforming. While the chapters in Part I (Optimal Estimation) 
studied optimal estimators, which are generally nonlinear functions of the observations, 
the chapters in Part II (Linear Estimation) focused on linear estimators with and without 
constraints. In all cases, the estimators were optimal in the sense that they minimized the 
mean-square error. 

Now there are situations where a designer may be interested in other performance crite- 
ria, other than the mean-square error criterion. Several examples to this effect are provided 
in the problems at the end of this part (e.g., Probs. III.12-1II.18). In most of these cases it 
is generally not possible to describe the optimal solution in closed-form in terms of the 
moments of the underlying variables, and it often becomes necessary to approximate the 
optimal solution iteratively. The iterative procedure would start from an initial guess for 
the solution and then improve upon it from one iteration to another. The purpose of this 
chapter is to describe one class of iterative schemes known as steepest-descent methods, 
which is at the core of most adaptive filtering techniques. 

The steepest-descent methods will be initially motivated by showing how they apply to 
the already-studied case of linear least-mean-squares estimation. By focusing on a situa- 
tion that is familiar to the reader, and one for which the optimal solution is already known, 
we will be able to highlight some of the abilities (and deficiencies) of iterative schemes. In 
particular, we will be able to show, even for the linear estimation problem, that steepest- 
descent methods are of independent value in their own right. For instance, they will help 
us avoid the need to invert R, in order to determine K, in the solution of the normal equa- 
tions K,R, = Rey. Such matrix inversions are challenging from a complexity point of 
view (requiring of the order of N° computations for an N x N matrix Ry); they are also 
challenging for ill-conditioned matrices R,, namely, for matrices that are close to singu- 
lar and that have a large ratio of largest to smallest eigenvalues. Once the main idea of 
steepest-descent has been examined in the context of linear estimation, we shall then show 
how to extend the technique to other estimation problems, with more involved performance 
criteria. 

Steepest-descent methods are not studied in this chapter only because they provide a 
mechanism for solving more involved estimation problems. In addition to this useful ob- 
jective, these methods are also important because they will serve as the launching pad for 
the development of adaptive filters in Chapters 10-14. It is because of this latter objective 
that, from now on and until the end of this textbook, we shall adopt a notation that is more 
specific, and also more suited, to the study of adaptive filters. 


Notation. In Parts 1 (Optimal Estimation) and II (Linear Estimation) of this book, we adopted 
the {x.y} notation, as is common in estimation theory, for the variable to be estimated and for the 
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observation vector. The variables {a, y) were general and they could refer to scalars or vectors. The 
results of the earlier chapters are of broad interest and they are not exclusive to the study of adaptive 
filters. However, from now on, we shall develop the theory of adaptive filters in greater detail. In 
this context, we will be mostly interested in the case in which æ is a scalar and y is a row vector. 
Moreover, the (2, y) variables will have specific meanings attached to them. For instance, 2 will 
denote the so-called "desired signal" and we shall replace it by the letter d, which will be a scalar. 
The observation vector y, on the other hand, will be a row vector and it will be denoted by u. In 
this way, we are now interested in estimating d from u. Some motivation for our choice of the row 
vector notation for u appears in the Notation section in the opening pages of this book. 

o 


8.1 LINEAR ESTIMATION PROBLEM 


So let d be a zero-mean scalar-valued random variable with variance oĉ, 


Ed=0, o? =Eld/? 


and let u be a 1 x M zero-mean random row vector with a positive-definite covariance 


matrix denoted by Ru 
Ru Ê Ewu | (a square matrix) 


The variables {d, u} are allowed to be complex-valued for generality, which, as we saw in 
several examples in Chapters 1—6, is usually a necessity in digital communications appli- 
cations. The M x 1 cross-covariance vector of {d, u} is denoted by 


(a column vector) 


We then consider the problem of estimating d from u in the linear least-mean-squares 
sense as follows: 


min E |d - ww)? 
em (8.1) 
where w is M x 1 and is known as the weight vector. 
Remark 8.1 (Row vector notation) Observe that since we choose u to be a row vector and the 


unknown w to be a column vector, the inner-product between u and w is simply written as ww with 
no transposition or conjugation symbols needed. 


We adopt this convention throughout our treatment 


of adaptive filters in this and subsequent chapters. 


All vectors, from this chapter onwards, will be column vectors with the notable exception of u, 
which will be a row vector. 


o 


We can proceed to solve (8.1) either afresh (i.e., from first principles) or by invoking 
the solution of the linear least-mean-squares estimation problem from Chapter 3. We shall 
argue in both ways in order to reinforce the main ideas. 


Using Linear Estimation Theory 


One way to determine the solution of (8.1) is to observe that it is a special case of the linear 
least-mean-squares estimation problem studied in Chapter 3, namely, 


min E(x — Ky)(r- Ky) (8.2) 
whose solution was seen to be 
Ko = ReyR;* 


with minimum cost given by 
m.m.s.e. = Rz — Ray Ry Rys 


and where R,, = Exy* and Ry = Eyy*. The statement (8.1) is a slight variation of (8.2) 
with the unknown w multiplying the observation u from the right. However, if we replace 
d — uw by its conjugate, we can restate (8.1) as 


min E|d* — w*u*|? (8.3) 
which is now of the form (8.2). In particular, we can make the identifications: 


d' — qa, u' — y, w' — K 


2 
ga  t— Rz, Ry — Ry, Hae — Rzy 


Using the already known solution to (8.2), we find that the solution (w°)* of (8.3) is given 


by 
(w?)" = Reu Ry? = RuR 


and the resulting m.m.s.e. is 


where Rua = Eud* = R%,,. We refer to w° as the optimal weight vector; the terminology 
“weight vector" refers to the fact that uw? is a weighted combination of the entries of u. 


so that, by transposition, 


Using the Orthogonality Principle 

Alternatively, we can solve (8.1) more directly by invoking the orthogonality principle 
of linear least-mean-squares estimation. Specifically, from Thm. 4.1, we know that the 
optimal weight vector w° should lead to an error variable, d — uw”, that is orthogonal to 
the observation vector u, i.e., it must hold that d — uw? L u or, equivalently, 


Eu*(d — uw?) = 0 (8.6) 


which means that w° should satisfy the normal equations Rau — Ruw’ = 0, and we are 
back to (8.4). Likewise, the resulting m.m.s.e. can be obtained from the orthogonality 
condition as follows: 


m.m.se. = Ej|d- uw?^[ 
= E(d-uw?)(d- uw?)' 
E(d— uw?)d' (because of (8.6)) 
= c2 — Ry R4 Rau 
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as in (8.5). 


Using Completion-of-Squares 


A third way to solve (8.1) is to proceed afresh, from first principles, by using a completion- 
of-squares argument similar to the one we employed in Sec. 3.3. So let 


J(w) Ê E|ld- uw? = E(d-uw)(d - uw) (8.7) 


denote the cost function in (8.1). Expanding J(w) we get 


J(w) = 03 — Rý w — w* Ra, + w* Ruw (8.8) 


which can be expressed in vector form as 


not wif BB ][S] o e 


We can now factor the center matrix into a product of upper-triangular, diagonal, and lower- 
triangular block matrices: 


| o3 "EE ea ede des " ]| XT 0 
u x ü du 1 


Substituting into (8.9) we obtain 
J(w) = (03 — RuaRg' Rau) + (w — Rg Rau)" Ra(w — Ry’ Rau) (8.10) 


from which it is clear, since R, > 0, that J(w) is minimized by choosing w as w° = 
Ry Rau, with the resulting minimum value given by 


m.m.se. = 02 — Ry, Rav (8.11) 


In summary, we arrive at the following statement. 


Theorem 8.1 (Optimal linear estimator) All random variables are zero-mean. 


Consider a scalar variable d and a row vector u with R, = Eu*u > 0. The 
linear least-mean-squares estimator of d given u is d = uw? where 


w? = Rma 


The resulting minimum mean-square error is m.m.s.e. — c2 — Ry Ra Rau. 


8.2 STEEPEST-DESCENT METHOD 


The solution w? of (8.1) is given in closed-form by (8.4). However, as mentioned in the 
introduction, such closed-form solutions are generally not possible for more elaborate per- 
formance criteria, other than the mean-square-error criterion (8.1). In such situations, it 
becomes necessary to resort to an iterative procedure in order to approximate the solu- 
tion w°. In this section we explain how one such iterative procedure can be devised for the 


Jiw) 


a = 
ey 


FIGURE 8.1 A typical plot of the quadratic cost function J(w) when w is two-dimensional and 
real-valued, say, w = col{a, 3}. 


already-solved problem (8.1). The ensuing discussion will be used later to motivate similar 
iterative procedures for more general cost functions. 

Consider again the cost function J (w) in (8.8), which is a scalar-valued quadratic func- 
tion of w. We already know that J(w) has a unique global minimum at w°? = R}! Rau 
with minimum value given by (8.11). Figure 4.1 shows a typical plot of J(w) for the case 
in which w is two-dimensional and real-valued. 


Choice of the Search Direction 


Now given J(w), and without assuming any prior knowledge about the location of its 
minimizing argument w?, we wish to devise a procedure that starts from an initial guess 
for w° and then improves upon it in a recursive manner until ultimately converging to w°. 
The procedure that we seek is one of the form 


(new guess) = (old guess) + (acorrection term) 


or, more explicitly, 
(8.12) 


where we are writing w;_, to denote a guess for w° at iteration (i — 1), and w; to denote 
the updated guess at iteration 7. The vector p is an update direction vector that we should 
choose adequately, along with the positive scalar jz, in order to guarantee convergence of 
w; to w°. The scalar u is called the step-size parameter since it affects how small or how 
large the correction term is. In (8.12), and in all future developments in this book, it is 
assumed that the index ? runs from 0 onwards, so that the initial condition is specified at 
i = —1. Usually, but not always, the initial condition w~; is taken to be zero. 

The criterion for selecting u and p is to enforce, if possible, the condition J(w;) < 
J(w;.i). In this way, the value of the cost function at the successive iterations will be 
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monotonically decreasing. To show how this condition can be enforced, we start by relating 
J (wj) to J(w;-1). Evaluating J(w) at w; = wi;_1 + pp and expanding we get 

J(wi) = 09 — Ry (wi-1 + up) — (wi~1 + up) Rau + (wi-1 + wp)*Ru(wi-1 + up) 
J(wi-1) + w(wj_, Ru — Ri,)p + up*(Ruwi-1 — Rau) + Ap" Rup (8.13) 


We can rewrite this equality more compactly by observing from expression (8.8) that the 
gradient vector of J(w) with respect to w is equal to 


Vud(w) = w* Ru — R5, (8.14) 


This means that the term (w7_,R, — R,) that appears in (8.13) is simply the value of the 
gradient vector at w = w;_}, ie., 


wi Ru — Riu = Vud (wi) 


Similarly, the matrix R, appearing in u?p"* Rup is equal to the Hessian matrix of J(w), 
ie., 

pp Rup = p?p [V5,J(wi-i)] p 
We can then rewrite (8.13) as 


J(w;) = J(wi-1) + 2p Re(VuJ(wi-i)p] + up* Ry, p (8.15) 


in terms of the real part of the inner product VJ (wj -1)p. 
Now the last term on the right-hand side of (8.13) is positive for all nonzero p since 
Ra > 0. Therefore, a necessary condition for 


J(wi) < J(wi-1) (8.16) 


is to require the update direction p to satisfy 


Re[V,J(wi-i)p < 0 (8.17) 


This condition guarantees that the second term on the right-hand side of (8.15) is strictly 
negative. The selection of a vector p according to (8.17) will depend on whether V » J (w;i-1) 
is zero or not. If the gradient vector is zero, then R,,wj-1 = Ray, and thus wj. .; already 
coincides with the desired solution w°. In this situation, recursion (8.12) would have at- 
tained w° and p should be selected as p = 0. 

When, on the other hand, the gradient vector at w;.. is nonzero, there are many choices 
of vectors p that satisfy (8.17). For example, any p of the form 


p = -B[VyJ(wii)* (8.18) 


for any Hermitian positive-definite matrix B will do (this choice will also give p = 0 when 
Vu J(wi-i) = 0). To see this, note that for any such p, the inner product in (8.17) is 
real-valued and evaluates to 


Vw J(wii)? = — [Vu J(wi-1)] B [Vu J(wi-i)]" 


which is negative in view of the positive-definiteness of B. The special choice B = Tis 
very common and it corresponds to the update direction 


p = — [Vu] (wi-1)]" = Bac Ro (8.19) 


This choice for p reduces (8.12) to the recursion 


| wi = wii Ray — Ruwi-il, i> 0, w- = initial guess (8.20) 


The update direction (8.19) has a useful and intuitive interpretation. Recall that the gra- 
dient vector at any point of a cost function points toward the direction in which the function 
is increasing. Now (8.19) is such that, at each iteration, it chooses the update direction p to 
point in the opposite direction of the (conjugate) gradient vector. For this reason, we refer 
to (8.20) as a steepest-descent method; the successive weight vectors (w;) are obtained by 
descending along a path of decreasing cost values. The choice of the step-size y is crucial 
and, if not chosen with care, it can destroy this desirable behavior. Choosing p according 
to (8.19) is only a necessary condition for (8.16) to hold; it is not sufficient as we still need 
to choose u properly, as we proceed to explain. 


Condition on the Step-Size for Convergence 
Introduce the weight-error vector 


Be nean 
Wi = W —UX 


It measures the difference between the weight estimate at time ¿ and the optimal weight 
vector, w°, which we are attempting to reach. 
Subtracting both sides of the steepest-descent recursion (8.20) from w? we obtain 


= W-1 — p[Ras — Ruwi-1) 


with initial weight-error vector w_1 = w° — w_,. Using the fact that w° satisfies the 
normal equations Ryw? = Rau, we replace Ry, in the above recursion by Ew? and 
arrive at the weight-error recursion: 


ù = [I - pRu]ūi-1. i20, 1 = initial condition | (8.21) 


This is a homogeneous difference equation with coefficient matrix (I — uR,). Therefore, 
a necessary and sufficient condition for the error vector Ù; to tend to zero, regardless of 
the initial condition w_, is to require that all of the eigenvalues of the matrix (I — uR,) 
be strictly less than one in magnitude. That is, (I — .R,,) must be a stable matrix. This 
conclusion is a special case of a general result. For any homogeneous recursion of the form 
Yi = Ayi- 1, it is well-known that the successive vectors y; will tend to zero regardless of 
the initial condition y_ if, and only if, all eigenvalues of A are strictly inside the unit disc. 
The argument that we give below establishes the result for the special case A = I — Ra. 
For generic matrices A, the proof is left as an exercise to the reader; see Prob. III.23. 

One way to establish that (I — Ru) must be a stable matrix is the following. Since Ru 
is a positive-definite Hermitian matrix, its eigen-decomposition has the form (cf. Sec. B.1): 


R, = UAU* (8.22) 


where A is diagonal with positive entries, A = diag( A], and U is unitary, i.e., it satisfies 
UU* = U*U = I. The columns of U, say, {qx}, are the orthonormal eigenvectors of Ry, 
namely, each q, satisfies 


Rugk =Ange: llall? = 1 
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Now define the transformed weight-error vector 
U* dj (8.23) 


Since U is unitary and, hence, invertible, z; and 17; determine each other uniquely. The 
vectors {x;, ti; } also have equal Euclidean norms since 


lil? = zz; = Of UUT uw wp. = [us]? 
I 


Therefore, if x; tends to zero then dj; tends to zero and vice-versa. This means that we can 
instead seek a condition on + to force z; to tend to zero. It is more convenient to work with 
z; because it satisfies a difference equation similar to (8.21), albeit one with a diagonal 
coefficient matrix. To see this, we multiply (8.21) by U* from the left, and replace R, by 
U AU* and I by UU", to get 


(8.24) 


z;—-|l-,uA]wii,  z-i1- U*wW_1 = initial condition 


The coefficient matrix for this difference equation is now diagonal and equal to (I — pA). 
It follows that the evolution of the individual entries of x; are decoupled. Specifically, if 
we denote these individual entries by 


a; = col(zi(i), za(i), ..., zm (1) 
then (8.24) shows that the k-th entry of z; satisfies 
zk(i) = (1— pA) ee (i — 1) 
Iterating this recursion from time —1 up to time ? gives 


zk(i)- (1—pAx)**? z4(-1), i20 (8.25) 


where z; (—1) denotes the k-th entry of the initial condition x_1. We refer to the coefficient 
(1 — pA&) as the mode associated with z&(i). Now in order for z&(i) to tend to zero 
regardless of z,(—1), the mode (1 — yA;) must have less than unit magnitude. This 
condition is both necessary and sufficient. Therefore, in order for all the entries of the 
transformed vector x; to tend to zero, the step-size u must satisfy 


lL-uX| < 1, forallk-1,2,..., M (8.26) 


The modes (1 — LA) are the eigenvalues of the coefficient matrix (I — uRu) in (8.21), 
and we have therefore established our initial claim that all eigenvalues of this matrix must 
be less than one in magnitude in order for 17; to converge to zero. The condition (8.26) is 
of course equivalent to choosing yz such that 


0 < p< 2/ Amex 


where Amax denotes the largest eigenvalue of R4. 


Theorem 8.2 (Steepest-descent algorithm) Consider a zero-mean random varii 
able d with variance c2 and a zero-mean random row vector u with R, = 
Eu*u > 0. Let Ana denote the largest eigenvalue of R,. The solution w° 
of the linear least-mean-squares estimation problem 


min E|d — uw!? 
w 


can be obtained recursively as follows. Start with any initial guess w1, 
choose any step-size u that satisfies 0 < u < 2/Amax, and iterate for i > 0: 


wi = wi-1 + H|Rau — Ruwi-1 


Then w; — w° as i — oc. 


8.3 MORE GENERAL COST FUNCTIONS 


With the above statement, we have achieved our original goal of deriving an iterative pro- 
cedure for solving the least-mean-squares estimation problem 


min E|d — uw? (8.27) 


The ideas developed for this case can be applied to more general optimization problems, 
say, 
min J(w) 
uw 


with cost functions J(w) that are nor necessarily quadratic in w (see, e.g., Probs. IIT.15— 
I1L.18). The update recursion in these cases would continue to be of the form 


wi = Wi-1 - p[VuJ(wii) (8.28) 


in terms of the gradient vector of J(-), and using sufficiently small step-sizes. Clearly, 
the expression for the gradient vector will be different for different cost functions. Now, 
however, since J(w) may have both local and global minima, the successive iterates w; 
from (8.28) need not approach a global minimum of J(w) and, subsequently, recursion 
(8.28) may end up converging to a local minimum. 

For this reason, convergence difficulties can arise for general cost functions and it is 
usually difficult to predict beforehand whether (8.28) will converge to a local or global 
minimum. The ultimate convergence behavior will depend on the value of the step-size 
parameter and on the location of the initial condition w_, relative to the local and global 
minima of J(w). In the problems, we shall use (8.28) to develop steepest-descent al- 
gorithms for some cost functions with multiple minima. The problems, as well as the 
computer project, will illustrate these difficulties. We did not encounter any of these dif- 
ficulties with the mean-square-error criterion (8.27) because the cost function in that case 
has a unique global minimum. 
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l, order to gain further insight into the workings of steepest-descent methods, we shall 
continue to examine recursion (8.20), namely, 


wi wioa-cu[Ra- Ruwi-1], $20, w- = initial guess (9.1) 


which pertains to the quadratic cost function (8.8). In particular, we shall now study more 
closely the manner by which the weight-error vector i7; of (8.21) tends to zero. We repeat 
the weight-error vector recursion here for ease of reference, 


v;-—-l[l—,&R,G; i1, $20, w_1 = initial condition (9.2) 


along with its transformed version (8.24): 


zi = [I — pA]zi-1, zi U*à.i = initial condition (9.3) 


9.1 MODES OF CONVERGENCE 


To begin with, it is clear from (9.3) that the form of the exponential decay of the k—th 
entry of z;, namely, z (i), to zero depends on the value of the mode 1 — Ap. For instance, 
the sign of 1 — A; determines whether the convergence of z, (i) to zero occurs with or 
without oscillation. When 0 € 1 — LA, < 1 the decay of z&() to zero is monotonic. On 
the other hand, when —1 < 1 — LA, < 0 the decay of x, (1) to zero is oscillatory. 


Example 9.1 (Exponential decay) 


Consider a two-dimensional data vector u, i.e., M = 2 and Ry, is 2 x 2. Assume the eigenvalues 
of Ry are Amin = 1 and Amax = 4. Then p must satisfy p < 2/Amax = 1/2 for convergence of 
the steepest-descent method (9.1) to be guaranteed. If we choose u = 2/5, then the resulting modes 
{1 — rx} will be 1 — pAmax = —3/5 < O and 1 — pAmin = 3/5 > 0. In this case, both entries of 
the transformed vector x; will tend to zero; however, one entry will converge monotonically (the one 
associated with Amin) while the other entry will converge in an oscillatory manner (the one associated 
with Amax). This situation is illustrated in Fig. 9.1. 5 


It is also clear from (8.25) that the mode (1 — Ax) with the smallest magnitude deter- 
mines the entry of x; that decays to zero at the fastest rate. Likewise, the mode (1 — p:A;) 
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Trajectories of two decaying modes 


Iteration 


FIGURE 9.1 Two exponentially decaying modes from Ex. 9.1. 


with the largest magnitude determines the entry of x; that decays to zero at the slowest 
rate. The above example shows that the fastest and slowest rates of convergence are not 
necessarily the ones that are associated with the largest and smallest eigenvalues of Ru, 
respectively. For the numerical values used in the example, both Amin and Amax lead to 
modes (1 — Ax} with identical magnitudes (equal to 3/5). Consider the following alter- 
native example. 


Example 9.2 (Fastest rate of decay) 


Assume again that M = 2 and that Amin = 1 and Amax = 3. Then jj must satisfy up < 2/Amax = 
2/3. Choose u = 7/12. Then 1 — ptAmax = —9/12 < 0 and 1 — pAmin = 5/12 > 0. This shows 
that the entry of zi that is associated with Amin (rather than Amax) will decay at the fastest rate. 


9.2 OPTIMAL STEP-SIZE 


In general, for each value of p, there are M modes, {1—pAx, k = 1,2,..., M). Among 
these modes there will be at least one with largest magnitude. This largest-magnitude mode 
exhibits the slowest rate of convergence among the entries of x;, and we therefore say that 
it is the one that ultimately determines the convergence rate of the algorithm (9.1). Now, 
different choices for u will lead to different slowest modes. This fact suggests that we could 
select js optimally by minimizing the magnitude of the slowest mode, i.e., by forcing the 
magnitude of the slowest mode to be as far away from one as possible. More specifically, 
we could choose p optimally by solving the min-max problem: 


min max |1 — pA | (9.4) 
H k=1,...,M 
subject to 
l1- pAx| <1 


Figure 9.2 plots the curves |1 — yA;| for different j.s; only four curves are shown 
corresponding to Amax (the left-most curve), Amin (the right-most curve), and two other 
eigenvalues. It is clear from the figure that the choice of u for which the largest-magnitude 
mode is furthest from one is the point at which the curves |1 — Amax| and |1 — pAmin| 
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intersect. If we denote this optimal value by 4°, then u° should satisfy 
1— i? Amin = -( T H° Amax) 


which leads to 


2 


max + Amin 


o = 
u P (9.5) 


The figure further indicates that there are actually two optimal slowest modes, with iden- 
tical magnitudes but opposite signs; they are obtained by evaluating (1 — Amin) and 
(1 — pAmax) at y = p°: 


Amax X Amin 
q—— ——— 


timal sl t modes — 
optimal slowest modes { eT ee 


If we define the eigenvalue spread of the covariance matrix Ry as p = Amax/Amin, then 
we can also write 


—1 
optimal slowest modes = + Lai 
pl 


These values can be interpreted as corresponding to the most favorable slowest conver- 
gence modes that we can expect. Observe that they are dependent on the eigenvalue spread 
of Ru. 


Values of H2 as a function of u and Us 


H-A d 


Step-size (u) 


FIGURE 9.2 The plot shows the curves {|1 — uA; |) restricted to the vertical interval [0, 1). The 
optimal step-size occurs at the point of intersection of the curves |1 — jzAmax| and |1 — Amis. 


Lemma 9.1 (Optimal step-size) Consider the setting of Thm. 8.2. The opti- 
mal selection for the step-size is 


a ia) 
Amax s Amin 


[^ 


u = 


where {Amax: Amin} are the maximum and minimum eigenvalues of the co- 
variance matrix R,. This step-size guarantees fastest convergence speed in 
the sense that the magnitude of the slowest modes is minimized, and this 
magnitude is given by 


magnitude of the optimal slowest mode(s) = —— 


where p denotes the eigenvalue spread of Ru, p = Amax/Amin. There are two 
optimal slowest modes; one negative and one positive. 


9.3 WEIGHT-ERROR VECTOR CONVERGENCE 


Let us now examine the convergence behavior of the weight-error vector. Since, as 
indicated by (8.23), Ŭŭ; = U zx;, it follows that Ù; is a linear combination of the columns 
of U, and the coefficients of this linear combination are the entries of r;. Using (8.25) we 


then get 
M M 


üi = qr welt) = 3 0 — wars) qe zx (-1) (9.6) 
k=1 k=1 

This expression shows that the convergence of wÙ; to zero is also determined by the slowest 
converging mode among the (1 — A; ): once the faster modes have died out relative to the 
slowest mode, it is the slowest mode that ultimately determines the convergence rate of w; 
to zero. Assume that this slowest mode of convergence corresponds to an eigenvalue Ax, . 
Then (9.6) shows that in the limit, as i — oc, tv; tends to zero along the direction of the 
associated eigenvector, qk, : 


wy; — (1-4uA,)" zu (—1)' qk, as i—5oc 


Moreover, since the evolution of each entry of Ñ; is governed by a combination of the 
modes (1 — Ax}, it does nor necessarily follow that the choice y = „° would result in 
Ù; going to zero at the fastest rate possible. There are instances in which the weight-error 
vector may converge slower initially compared to other choices of the step-size. The opti- 
mal step-size „° guarantees that in the later stages of convergence, when the slowest mode 
becomes dominant, the convergence of Ù; will be the fastest relative to any other choice of 
the step-size. We illustrate this behavior in the next example. 


Example 9.3 (Rates of convergence) 


Figure 9.3 shows six runs of the steepest-descent algorithm (9.1) with different choices of the step- 
size parameter p. 

For the top figure, the matrix R, is 5 x 5 with largest and smallest eigenvalues given by Amax = 
7.7852 and Amin = 0.3674. The optimum step-size in this case is u^ = 0.2453. The simulation 
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Trajectories of the squared weight-error norm for three step-sizes 


Iteration 


Trajectors of the squared weight-error norm for three step-sizes 


iteration 


FIGURE 9.3 Plots of the evolution of ||i;||? for three different choices of the step-size; the 
covariance matrices Ru are different for both plots. The optimal choice à? results in fastest 
convergence in the top plot right from the initial iterations while, in the bottom plot, it results in 
a slower decay initially but catches up as the iterations progress. 


uses the step-sizes (u^, u?/2, u?/4), so that the resulting modes are 


p=? = 0.2453 => ([0.9099 |,0.5755, 0.6129, 0.8325, [ 0.9099 |) 


w=p°/2 = 0.1227 — (| 0.9549 |, 0.9163, 0.8065, 0.7877, 0.0451} 
p=p°/4 = 0.0613 => {| 0.9775 | 0.9032, 0.8939, 0.8345, 0.5225} 


Observe that the slowest modes are at {+0.9099, 0.9549, 0.9775} with, of course, the smallest 
among them in magnitude corresponding to y = p°. We see from the curves in the top figure 
that the choice ji? leads to the fastest convergence of ||Ū; ||? to zero. 

The bottom plot in Fig. 9.3 illustrates a situation for which convergence is initially slower when 
the step-size is chosen as y = p°. The example corresponds to a 5 x 5 matrix Ru with largest 
and smallest eigenvalues at Amax = 7.4724 and Amin = 0.0691. The optimum step-size now is 
u° = 0.2652. In the simulation we again use (1^, u°/2, p? /4) so that the resulting modes are: 


p= p° 0.2652 => (| —0.9817 |, 0.6864, 0.8898, 0.9345,| 0.9817 | } 
w=p°/2 = 0.1326 => {0.0092, 0.84320.9449, 0.9672,; 0.9908 |} 
p=p°/4 = 0.0663 => {0.5046, 0.9216, 0.9725, 0.9836,' 0.9954 |} 


The slowest modes are at {+0.9817, 0.9908, 0.9954} with the smallest among them in magnitude 
corresponding to u = u^. However, we now see from the bottom plot in Fig. 9.3 that the curve that 
corresponds to y = u° /2 decays faster initially than the one that corresponds to 4 = 4°. Observe 
however that, as 7 increases, the curve that corresponds to y = u° ultimately converges faster; this 


is because for larger values of i, it is the slowest mode that determines the rate of convergence of wi. 


© 


9.4 TIME CONSTANTS 


It is customary to describe the rate of convergence of a steepest-descent algorithm in 
terms of its time constants, which are defined as follows. 

Recall that for an exponential function f(t) = e7*/7, the time constant is 7 and it 
corresponds to the time required for the value of the function to decay by a factor of e 
since 


F(t +7) =e ON" = f(t)/e 


Now, for an exponential discrete-time sequence of the form (cf. (8.25)):3 
|ze(#)|? = (1— uà)? |ze(i-1))?, im0 


the value of |x, (z)|? decays by (1 — :A;,)? at each iteration. Let T denote the time interval 
between one iteration and another, and let us fit a decaying exponential function through 
the points of the sequence (iz; (i)|?). Denote the function by f(t) = e~*/7*, with a time 
constant 7; to be determined. Then we must have 


f@leg-syr [zi (i = Dp? = e G-DT/Tk 
f (t) aur = (1 — BA)? iz Jn DP? = e T/T 


Dividing one expression by the other leads to e~7/7* = (1 — pàg)? or, equivalently, 


-T 
Tk & ZR (measured in units of time) 
This value measures the time that is needed for the value of [x (;)|? to decay by a factor 
of e, which corresponds to a decrease of the order of 10log e ~ 4.4 dB. It is common to 
normalize the value of 7; to be independent of T. Thus, let 7, = Tk/T. Then 


=1 
z ê SEND (measured in iterations) (9.7) 


This normalized value measures the approximate number of iterations that is needed for 
the value of |z;(i)|? to decay by approximately 4.4 dB. For sufficiently small step-sizes 
(say, for LÀ; < 1), we have In |1 — pàg| &z —4A and we can approximate the expression 
for 7, by 


Ty R2 — iterations 
Jade ( ) 


Usually, the largest {7,,4 = 1,2,..., M} is taken as indicative of the time constant of the 
steepest-descent method. 


3Note that we are using the squared quantity [2;, (7) |? instead of x, (i) in order to obtain a sequence of decaying 
nonnegative real numbers. 
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9.5 LEARNING CURVE 


Besides modes and time constants, it is also customary to characterize the convergence 
performance of a steepest-descent method in terms of its learning curve. Recall that our 
original problem is to determine the vector w that minimizes 


J(w) = E|d - uw? 
The steepest-descent recursion (9.1) provides successive iterates w; with cost values 
J(w;) = E|d - uw, |? 


Since, by choosing the step-size u such that y < 2/Amax, we are guaranteed a sequence 
(wi) that converges to the optimal solution w°, the same condition on p also guarantees 
that the successive values J(w;) will converge to the minimum value of J(w), namely 
(cf. (8.11)): 


J(wi) —9 Jmin = c2 T Rua R} Rau as à — oo 


It turns out, as we now verify, that the decay of J(w;) to Jmin is always monotonic. To see 
this, we recall from (8.10) that 


J(w) = Jmin + (w — w?)' R.(w — w°) (9.8) 


ie., 


J(wi) = Jmin + Bf Rut; (9.9) 


The term 157 RU; represents the excess mean-square error at iteration i and it will be 
denoted by 


&(wi) 2 J(wi) — Jmin = Wj RU; (9.10) 


It measures how far the cost at iteration 7 is from the minimum cost, Jmin. 
If we replace Ù; by Uz;, and use the eigen-decomposition (8.22), we obtain 


M M 
T(wi) = Jmin Y Aleli)? = Jmm + Y A(1— uA)? 6*2 Jare(—-1))? 
k=1 k=1 


which confirms, under the requirement 0 < u < 2/Amax, that J(w;) — Jmin as i — oo, 
irrespective of the initial weight-error vector W_;. Moreover, the convergence is both 
exponential and monotonic; it is monotonic since, for any k, the coefficient A,(1 — Ax)? 
is positive. 

The evolution of J (wj) as a function of i provides useful information about the learning 
behavior of a steepest-descent algorithm. For future reference, we shall adopt the follow- 
ing definition. 


Example of a learning curve for a steepest-descent algorithm 


MSE 
N 


AS Ge 
5 10 15 20 25 30 35 
iteration 


FIGURE 9.4 A typical learning curve J (i) for algorithm (9.1). 


a a a a 
Definition 9.1 (Learning curve) The learning curve of a steepest-descent 


method associated with a cost function J(w) is denoted by J (i) and defined 
as J(i) = J(w;-1) for i > 0. In particular, for the quadratic cost function 
J(w) in (8.7), we obtain that its learning curve is given by 


J(i)- Ele(i)? where e(i) = d- uwi-ı 


is the so-called a priori output estimation error. In this case, the learning 
curve is also called the mean-square-error (MSE) curve. 


Observe that the initial value of J(i) is 
J(0) = J(w-1) 


In other words, it is the value of the cost function evaluated at the initial condition w_}. 

In general, the value of the learning curve at an iteration i is a measure of the cost 
that would result if we freeze the weight estimate at the value obtained at the prior it- 
eration. Correspondingly, in the mean-square-error case, the learning curve is defined 
in terms of the variance of the a priori error e(i) (which uses w;—ı and not w;). Fig- 
ure 9.4 shows a typical learning curve for the steepest-descent algorithm (9.1) with M = 3, 
Amin = 0.3, Amax = 1, and u = 1.5385. The modes {1 — Ax} for this simulation are at 
10.5385, 0.0769, —0.5385). 


9.6 CONTOUR CURVES OF THE ERROR SURFACE 


Another useful way to examine the performance of a steepest-descent method is by 
examining the contours of constant value of its cost function, J(w). These contour curves 
are more easily characterized if we perform a change of coordinates. For any w, we define 
z = U*(w — w?) or, equivalently, 


oan 


where U is obtained from the eigen-decompostion (8.22) of R4. In other words, we replace 
the w—coordinate system by a z—coordinate system. The origin of the new system, z = 0, 
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FIGURE 9.5 Change of coordinates from the w—domain to the z—domain, defined by z = 
U*(w — w^), for the case M = 2. 


occurs at the point w = w° in the w—coordinate system. Likewise, the first basis vector in 
the z—coordinate system, namely, 


zı =col{1,0,...,0} 


corresponds to the vector 
w=w +q 


in the w—coordinate system, where q; is the first column of U. This means that the first 
basis vector in the z—domain is obtained by shifting qı to w° in the w—domain. A similar 
construction holds for the other basis vectors of the z—domain. This change of basis is 
illustrated in Fig. 9.5 for the case M = 2. We therefore say that the new coordinate system 
is obtained by shifting the origin to w? and then rotating the w-axes by U*. The minimum 
value of J (w), which occurs at w = w° in the w—domain, will now occur at z = 0 in the 
z—domain. 

Using (9.8), and the eigen-decomposition R, = U AU*, we can express the cost func- 
tion as 


M 
J(z) = Jmin + 2*Az = Jmin + yEESEOR 
k=1 


where the {z(k)} denote the entries of z. The contour curves of J(z) (and, correspond- 
ingly, of J(w)), are the curves for which 


J(z) = aconstant (9.12) 


for different constant values. Equation (9.12) defines a hyper-ellipse in M —dimensions. 
The hyper-ellipse is centered at w° and it has M principal axes. The principal axes are, by 
definition, the lines that pass through the origin and are normal to the hyper-ellipse. For 
J(z), its principal axes coincide with the basis vectors of the z—coordinate system. To 
see this, note first that the gradient of J(z) with respect to z* is equal to Az. Moreover, 
any line passing through the origin has the form Az for some scalar À. Therefore, for any 


Contours of constant mean-square-error and weight vector trajectory 
T T T T T T 


p-axis 


Initial 
condition 


-1 


o a-axis 


FIGURE 9.6 Elliptic contours of constant mean-square error in two dimensions, where the entries 
of w are denoted by w = col{a, 8) and the entries of w° are {a°, 8°}. The figure also indicates a 
typical trajectory starting from some initial condition w..;. 


such line to be normal to the hyper-ellipse it should satisfy Az = Az. This equality is 
possible only if A is an eigenvalue of A and z the corresponding eigenvector. But since A 
is diagonal, this conclusion requires z to be one of the basis vectors. Therefore, the basis 
vectors of the z—coordinate system are normal to the hyper-ellipse and, consequently, they 
are the principal axes of the hyper-ellipse. We therefore find that the eigenvectors of R,, 
when shifted to w°, are the principal axes of the elliptic contours of J(w). 

These conclusions are illustrated in Fig. 9.6 in the 2-dimensional case (M — 2). As 
the figure indicates, the contour curves are ellipses centered at w? and with principal axes 
along the directions of the eigenvectors of R,,. Moreover, since the steepest-descent al- 
gorithm (9.1) updates w;_1 along the direction of the negative (conjugate) gradient vector 
at w;—1, then the steepest-descent algorithm will be such that it moves from one contour 
level to another along a direction that is normal to the elliptic contour at w;_1. This process 
continues until convergence is attained. 


9.7 ITERATION-DEPENDENT STEP-SIZES 


The steepest-descent algorithm of Thm. 8.2 uses a constant step-size jj. In many in- 
stances, it may be desirable to vary the value of the step-size in order to obtain better 
control over the speed of convergence of the algorithm. 


Condition for Convergence 


Starting with (8.12), the arguments of Sec. 8.2 would still hold if we replace u by an 
iteration-dependent positive step-size (i). In this case, recursion (9.1) would be replaced 
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by 


wi = wWi-1 + W(4)[Raw — Rywi-i], w- = initial guess (9.13) 


Of course, not every choice of the step-size sequence { (i) } will guarantee convergence of 
tw; to w?. For example, one might be tempted to extrapolate the arguments of Sec. 8.2 and 
conclude that by choosing u(i) such that u(i) < 2/Amax for all i, the weight-error vector 
Ù; will converge to zero. This conclusion is generally false. Consider for illustration 
purposes, a scalar recursion of the form z(i) = a(i)z(i — 1) fori > 0. Then 


zli) = | [400] 2(-1) 
j=0 
If the {a(j)} are such that |a(j)| < 1 for all finite j, it does not necessarily follow that 
[[ eu) 29 as i— œ% 
j=0 


That is, the product of infinitely many numbers that are all less than one in magnitude is 
not necessarily zero (see Prob. II.2). The product would tend to zero if all the {a(j)} have 
their magnitudes uniformly bounded away from one, say, |a(j)| < œ < 1 for all j and for 
some a > 0. 

The following statement provides one necessary condition on u(i) in (9.13) for conver- 
gence; the proof is given in Probs. IIL.9 and III.10. As explained after the statement of the 
theorem, other conditions are possible. 


Theorem 9.1 (Convergence condition) Given a zero-mean random variable d 
with variance c? and a zero-mean random row vector u with R, = Eu*u > 
0, the solution of the linear least-mean-squares estimation problem 


min E|d — uw|? 


can be obtained recursively as follows. Start with any initial guess w-—1, 
choose a bounded step-size sequence u(i) that tends to zero, i.e., u(t) > 0, 
and iterate: 

wi = wii + u(i)|Ra, — Ruwi-i], i20 


Then w; — w° as i — oo if, and only if, the step-size sequence satisfies 
Lo H(i) = oo. That is, if and only if, {u(i)} is a divergent sequence. 


The result of Thm. 9.1 requires the sequence u(i) to tend to zero but not too fast since 
the sequence has to diverge as well. A typical sequence that satisfies the conditions of the 
theorem is 


a 
j) - —— i> 
uli) TIP’ a>0, @>0, i20 


Other examples are any bounded step-size sequences that satisfy both conditions 


O <œ and Ya) = oo 


i=0 i=0 


This is because the finite-energy condition on the sequence {,.(2)} guarantees u(i) — 0. 
Still, convergence can occur even if the conditions of the theorem are violated. For exam- 
ple, in Prob. II.4 it is shown that with y(i) — a > 0 as i — oo, i.e., even with p(z) 
tending to a nonzero limit, but as long as œ < 2/Amax, then w; is guaranteed to converge 
to qw?. 


Optimal Iteration-Dependent Step-Size 

We saw earlier in Sec. 9.1 that the choice u° = 2/(Amax + Amin) results in fastest conver- 
gence to steady-state in the constant step-size case. We can also seek an optimal choice for 
fastest convergence in the iteration-dependent step-size case. To do so, we start from the 
weight-error recursion 


Wi = (I on u(t) Rul Uii (9.14) 


and use expression (9.10) for the excess mean-square error to write 


£(w) = wiH,Ui 

wil- w(t) Ru] Ru E- n) Ru) wi-1 

= Üla [Ru — ui) R2 + (i) RB] i-i 

= £(wii) — 2u()ü; R ði- + pli] Riu 

= €(wi-1) — Du()u; Rie: — p (i0 R üi] (9.15) 


By maximizing the term 
ae Dig 2(;WB* DÀ. 
2u(im; Raði-1 — wu); Ryti- 


with respect to (i) we can guarantee largest decay in the excess mean-square error at 
iteration 7. This maximization step leads to the optimal choice 


(9.16) 


This argument is possible here since the step-size is allowed to vary from one iteration 
to another and, therefore, we can minimize é(w;) with respect to u(i) at each i. The 
same argument is not possible in the constant step-size case. There we selected u° by 
solving the min-max problem (9.4) instead. We still need to argue that the choice (9.16) 
leads to a convergent algorithm; in the process we shall also verify that {°(2)} leads to a 
monotonically decreasing excess mean-square error. 

First note that the choice (9.16) does not satisfy the convergence conditions of Thm. 9.1. 
For instance, assume w,;-; and R, are scalars. Then, expression (9.16) collapses to 
u?(i) = 1/R,, which does not tend to zero as required by the statement of Thm. 9.1 
— see also Prob. IIL.5. However, as mentioned in the comments following the statement 
of the theorem, the conditions on u(i) in the theorem are only sufficient. There can exist 
step-size sequences that do not satisfy the conditions of the theorem but nevertheless result 
in convergence. The sequence {,°(i)} is one such example. 

To see this, we substitute (9.16) into expression (9.15) for £(w;) and get 
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Learning curves for optimal constant and iteration-dependent step-sizes 
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FIGURE 9.7 Typical plots of learning curves corresponding to the cases of an optimal constant 
step-size and an optimal iteration-dependent step-size. 


If ùi- = 0, then w;_1 = w° and algorithm (9.13) would have converged since w; = wii 
for all 7. So assume w;_,; 4 0. Then we necessarily have, since Ry > 0, 


€ (wi) < &(wi-1) 


That is the sequence (£?(w;)) is monotonically decreasing. Combining this result with 
the fact that {€°(w;)} is bounded from below (since £?(w;) > 0), we conclude that the 
sequence {€°(w,)} is convergent. The question that remains is whether its limit is zero. 

We showed earlier in Sec. 8.2 that there exists a constant step-size that guarantees con- 
vergence of 1j; to zero (namely, any y satisfying 4 < 2/Amax). Choose any such step-size 
and let £^ (w?') denote the excess mean-square error that results from the recursion 

we = wia + [Ras — Ruwi] 

We are writing w/' to distinguish it from the weight estimate obtained from the recursion 
wi = Wi-1 + H? (i) [Ray — Ruwi—1] with the optimal iteration-dependent step-size. Then, 
starting from the same initial conditions w_; = w^" ,, it holds that 


Elw) < &(wf) 


This is because the optimal iteration-dependent step-size minimizes the excess mean- 
square error at each iteration. But since €” (w¥) — 0, we conclude that £?(w;) — 0, 
as desired. 

Figure 9.7 plots the learning curve J (i) = J(wj.1) for both cases of an optimal con- 
stant step-size and an optimal iteration-dependent step-size. The result confirms the faster 
decay of the learning curve in the latter case. 


9.8 NEWTON'S METHOD 


We mentioned in our derivation of the steepest-descent algorithm in Sec. 8.2 that any 
choice for the search direction of the form (cf. (8.18)): 


p=—B (VoJ(wi-1)" 
for any positive-definite matrix B, can be used to enforce the condition 


Re[V,J(wi-1)p] < 0 


We chose B = Lin our earlier discussions, which led to the steepest-descent variants 
of Thms. 8.2 and 9.1. But other choices for B are possible and they lead to different 
algorithms with different properties. One useful choice for B is 


B = [V2J(wic)] 
in terms of the inverse of the Hessian matrix, in which case the search direction becomes 
p=- [V2J(wi-i)] [Vu J(wi-1)]" (9.17) 
The resulting steepest-descent recursion (8.12) would be 


=] 


Wi = Wi-1 — H [V2 J (wi-1)] 


[VuwJ(wi-1)]*, $20, w.; = initial guess 
(9.18) 


This recursive form is known as Newton's method. 
For the quadratic cost function J(w) of (8.8), we use (8.14) to find that (9.18) reduces 
to 
wi = wis + URI! (Ra, — Rywizi (9.19) 


We can examine the properties of this algorithm in much the same way as we did for re- 
cursion (9.1). So we shall be brief. 


Convergence Properties 
Subtracting both sides of (9.19) from w°, and using the fact that w° satisfies the normal 
equations Ryw? = Rau, we arrive at the weight-error recursion 


ib; = (1— uii (9.20) 


In contrast to (9.2) and (9.14), we find that the covariance matrix R, does not appear any 
longer in (9.20). In particular, convergence is now guaranteed for all step-sizes yz that 
satisfy 0 < u < 2; a condition that is independent of R4. 

Actually, the choice p = 1 in (9.20) leads to immediate convergence because w; = 0 
with no further iteration. This is a well-known property of Newton's method; the method 
guarantees convergence in a single iteration to the minimizing argument of a quadratic 
cost function by choosing u = 1. This fact can also be seen from recursion (9.19), which 
for u = 1 collapses to 


=j 
wi = wi-1 + R} Ray — Wi-1 = Wi-1 + W? — Wi-1 =w 


Of course, applying Newton’s method (9.19) to the solution of the least-mean-squares es- 
timation problem (8.1) has the same complexity as using (8.4) since both schemes require 
the inversion of Ru. The usefulness of Newton’s method relative to the normal equations 
will become evident later in Chapters 10 and 11 when we use it to devise some useful 
adaptive variants. 


Learning Curve 

Recursion (9.19) estimates the vector w that minimizes J(w) = E|d — uwj?. It does so 
by evaluating successive iterates w; with cost values J(w;) = E|d — uw;|?. Since, by 
choosing 0 < u < 2, we are guaranteed a sequence (wj) that converges to w°, this same 
condition on u guarantees convergence of J(w;) to the minimum value of J(w), namely 
(cf. (8.11), 


J(wi) — Jmin = J(w°) = o? — RugRZ Rag, as i—> oœ 
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The decay of J(w;) to Jmin is again monotonic. This can be seen as follows. Using the 
representation (9.9), 

J (wi) — Jmin + 0; RU, 
replacing w; by Ux;, where w; now evolves according to (9.20), and using the eigen- 
decomposition (8.22) for R,, we obtain 


M M 
J(wi) = Jmin +> Ados)! = Jmin + (1- pyr) Sok Iz. (71)? 
k=1 k=1 


This expression confirms that, under the requirement 0 < u < 2, 


lim J(wi) = J min 

Io 
irrespective of the initial weight-error vector 1j. ;. Moreover, the convergence is both 
exponential and monotonic and, in contrast to the steepest-descent analysis of Sec. 9.5, 
convergence is now governed by a single mode at (1 — yx)”. Therefore, with Newton's 
method, we need only associate a single time constant that is equal to (cf. (9.7)): 


7 — —1/2I(1 - p) (iterations) 


The value of 7 is an approximation for the number of iterations that is needed for ||2,||? to 
decay by approximately 4.4 dB. 

With regards to the contour curves of the error surface, they are still the same hyper- 
elliptic curves that were described in Sec. 9.6 (after all we are dealing with the same 
quadratic cost function J(w) from (8.8)). As shown in that section, the principal axes 
of the contour curves are the eigenvectors of the covariance matrix R, shifted to the lo- 
cation of w°. Now, however, the search direction in Newton's method is nor along the 
normal direction to the elliptic curves anymore, but along the line connecting w;_; to w°. 
To see this, recall that when u = 1, convergence of Newton's method occurs in a single 
step, which is only possible if the search direction is along the line connecting w;..; to w°. 
When u Z 1, we are still moving along the same direction connecting w;—, to w? but for 
a shorter distance since from (9.19), 


Wi = wii + p(w? — wii) 
Remark 9.1 (Regularization) When the Hessian matrix in (9.18) is close to singular, it is com- 


mon to employ regularization, in which case Newton’s method is sometimes known as the Levenberg- 
Marquardt method and it becomes 


wi = wi-1 — [d + ViJ(wii) ![VJ(wi-1)", £20, w-1 = initial guess 


The difference relative to Newton’s recursion (9.17) is the addition of the small positive parameter 
e. This algorithm can still be interpreted as a steepest-descent method of the form (8.12) with B in 
(8.18) chosen as 

B = [el + Vind (wi-1i)) " 


More generally, we can employ iteration-dependent step-sizes, u(i), and iteration-dependent regu- 
larization parameters, e(i) > 0, and write instead 


wi = wi-1i — wei) [e(4)I + VAJ(wii) !(VJ(wi-1)', i20, w- = initial guess 
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LMS Algorithm 


l. Chapters 10-14 we start to develop the theory of adaptive algorithms by studying 
stochastic-gradient methods. These methods are obtained from steepest-descent imple- 
mentations by replacing the required gradient vectors and Hessian matrices by some suit- 
able approximations. Different approximations lead to different algorithms with varied 
degrees of complexity and performance properties. The resulting methods will be generi- 
cally called stochastic-gradient algorithms since, by employing estimates for the gradient 
vector, the update directions become subject to random fluctuations that are often referred 
to as gradient noise. 

Stochastic-gradient algorithms serve at least two purposes. First, they avoid the need to 
know the exact signal statistics (e.g., covariances and cross-covariances), which are neces- 
sary for a successful steepest-descent implementation but are nevertheless rarely available 
in practice. Stochastic-gradient methods achieve this feature by means of a learning mech- 
anism that enables them to estimate the required signal statistics. Second, these methods 
possess a tracking mechanism that enables them to track variations in the signal statistics. 
The two combined capabilities of learning and tracking are the main reasons behind the 
widespread use of stochastic-gradient methods (and the corresponding adaptive filters). It 
is because of these abilities that we tend to describe adaptive filters as “smart systems"; 
smart in the sense that they can learn the statistics of the underlying signals and adjust 
their behavior to variations in the "environment" in order to keep the performance level at 
check. 

In the body of this chapter and the subsequent chapters, we describe a handful of 
stochastic-gradient algorithms, including the least-mean-squares (LMS) algorithm, the nor- 
malized least-mean-squares (NLMS) algorithm, the affine projection algorithm (APA), and 
the recursive least-squares algorithm (RLS). In Probs. I11.26-III.38 we devise several other 
stochastic-gradient methods. By the end of the presentation, the reader will have had suf- 
ficient exposition to the procedure that is commonly used to motivate and derive adaptive 
filters. 


10.1 MOTIVATION 


We start our discussions by reviewing the linear estimation problem of Sec. 8.1 and the 
corresponding steepest-descent methods of Chapter 8. Thus, let d be a zero-mean scalar- 
valued random variable with variance o2 = E |d|?. Let further u* be a zero-mean M x 1 
random variable with a positive-definite covariance matrix, Ry = Eu*u > 0. The M x 1 
cross-covariance vector of d and u is denoted by Rg, = Edu*. We know from Sec. 8.1 
that the weight vector w that solves 


min Eld - uv? (10.1) 
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is given by 


and that the resulting minimum mean-square error is 


dos) 


In Chapter 8 we developed several steepest-descent methods that approximate w° itera- 
tively, until eventually converging to it. For example, in Sec. 8.1 we studied the following 
recursion with a constant step-size, 


wi = wi-1 + [Rau — Rywi-i] w-1 = initial guess (10.4) 


where the update direction, Fa, — Ry wi_i, was seen to be equal to the negative conjugate- 
transpose of the gradient vector of the cost function at w;_1, i.e., 


Rau — Ruwi- = — [VsJ(wi-i)l (10.5) 


where 
J(w) Ê E|d - uw? 


In Sec. 9.7 we allowed for an iteration-dependent step-size, u(i), and studied the recursion 


wi = wi-1 + w(t) [Raw — Ruwi-1], w_1 = initial guess (10.6) 


and in Sec. 9.8 we studied Newton’s recursion, 


wi = wi-1+ UR] [Rau — Ruwi-1], w-1 = initial guess (10.7) 


where R7! resulted from using the inverse of the Hessian matrix of J(w), namely, 
Ry = Vi, J(wi-1i) = Var [Vud (wi-r)] 


More generally, when regularization is employed and when the step-size is also allowed to 
be iteration-dependent, the recursion for Newton’s method is replaced by 


Wi = Wi-1 + u(i) [e(i)I + R^ [Rau — R,wi-1| (10.8) 


for some positive scalars {e(i)}; they could be set to a constant value for all i, say, e(i) = €. 

Now all the steepest-descent formulations described above, i.e., (10.4), (10.6), (10.7) 

and (10.8), rely explicitly on knowledge of ( Rau, Ru}. This fact constitutes a limitation in 

practice and serves as a motivation for the development of stochastic-gradient algorithms 
for two reasons: 

i. Lack of statistical information. First, the quantities ( Rau, Ru } are rarely avail- 

able in practice. As a result, the true gradient vector, V. J(w;_1), and the true Hes- 

sian matrix, V2, J(w;..1), cannot be evaluated exactly and a true steepest-descent im- 

plementation becomes impossible. Stochastic-gradient algorithms replace the gra- 

dient vector and the Hessian matrix by approximations for them. There are several 

ways for obtaining such approximations. The better the approximation, the closer we 

expect the performance of the resulting stochastic-gradient algorithm to be to that of 

the original steepest-descent method. In Parts IV (Mean-Square Performance) and V 


(Transient Performance) we shall study and quantify the degradation in performance — 165 
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2. Variation in the statistical information. Second, and even more important, 
the quantities {Rz,,, Ru, ) tend to vary with time.^ In this way, the optimal weight 
vector w? will also vary with time. It turns out that stochastic-gradient algorithms 
provide a mechanism for tracking such variations in the signal statistics. 


We therefore move on to motivate and develop several stochastic-gradient algorithms. 


10.2 INSTANTANEOUS APPROXIMATION 


Assume that we have access to several observations of the random variables d and u in 
(10.1), say, (d(0), d(1), d(2), d(3),...} and (uo, ui, uz, us,...]. We refer to the (u;) 
as regressors. Observe that, in conformity with our notation in this book, we are using 
the boldface letter d to refer to the random variable, and the normal letter d to refer to 
observations (or realizations) of it. Likewise, we write u for the random vector and u for 
Observations of it. 

One of the simplest approximations for {Rau, Ru } is to use the instantaneous values 


Hy, = d(i)ut, Ry = užu (10.9) 


This construction simply amounts to dropping the expectation operator from the actual 
definitions, Ry, = Edu* and R, = Eu*u, and replacing the random variables (d, u} 
by the observations {d(z), u;} at iteration 7. In this way, the gradient vector (10.5) is 
approximated by the instantaneous value 


~ [Vy] (wi) em d(i)u — ufujwi-1 = uf d(i) — uiwi.i] 
and the corresponding steepest-descent recursion (10.4) becomes 
wi = wii + puž [d(i) — uw], w.;- initial guess (10.10) 


We continue to write w; to denote the estimate that is obtained via this approximation pro- 
cedure although, of course, w; in (10.10) is different from the w; that is obtained from 
the steepest-descent algorithm (10.4): the former is based on using instantaneous approx- 
imations whereas the latter is based on using {R4u, Ru}. We do so in order to avoid an 
explosion of notation; the distinction between both estimates is usually clear from the con- 
text. 

The stochastic-gradient approximation (10.10) is one of the most widely used adaptive 
algorithms in current practice due to its striking simplicity. It is widely known as the least- 
mean-squares (LMS) algorithm, or sometimes as the Widrow-Hoff algorithm in honor of 
its originators. 


^We have so far interpreted the index í in a steepest-descent recursion, e.g., as in (10.4), as merely an iteration 
index. In the adaptive context, however, it will become common to interpret i as a time index in which case w; is 
interpreted as the weight estimate at time 7. 
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Algorithm 10.1 (LMS Algorithm) Consider a zero-mean random variable d 


with realizations (d(0), d(1),...), and a zero-mean random row vector u 
with realizations (uo, u1,...). The optimal weight vector w° that solves 


min Eld - uw|* 
w 
can be approximated iteratively via the recursion 
wi = Wi-1 + uu;[d(i)— uiwi-1], $20, w- = initial guess 


where p is a positive step-size (usually small). 


10.3 COMPUTATIONAL COST 


A useful property of LMS is its computational simplicity. The evaluations that follow for 
the number of computations that are required by LMS are intended to provide an approx- 
imate idea of its computational complexity. While there can be different ways to perform 
specific calculations, the resulting overall filter complexity will be of the same order of 
magnitude (and often quite similar) to the values we derive in this section for LMS, and in 
subsequent sections for other adaptive filters. 

Note that each step of the LMS algorithm requires a handful of straightforward compu- 
tations, which are explained below: 


1. 


Each iteration (10.10) requires the evaluation of the inner product u;w;..1, between 
two vectors of size M each. This calculation requires M complex multiplications 
and (M — 1) complex additions. Using the fact that one complex multiplication 
requires four real multiplications and two real additions, while one complex addition 
requires two real additions, we find that the evaluation of this inner product requires 
4M real multiplications and 4M — 2 real additions. 


. The algorithm also requires the evaluation of the scalar d(i) — u;w;.;. This calcu- 


lation amounts to one complex addition, i.e., 2 real additions. 


. Evaluation of the product 2[d(i) — u;w;..1], where pis a real scalar, requires two real 


multiplications when the data is complex-valued. Usually, 4 is chosen as a power 
of 271, say, 2-™ for some integer m > 0. In this case, multiplying p by [d(i) — 
uiwi—ı]} can be implemented digitally very efficiently by means of shift registers, 
and we could therefore ignore the cost of this multiplication. In the text, we choose 
to account for the case of arbitrary step-sizes. 


. The algorithm further requires multiplying the scalar p[d(i) — u;w;..1] by the vector 


u;. This requires M complex multiplications and, therefore, 4M real multiplications 
and 2M real additions. 


. Finally, the addition of the two vectors w;—ı and uu; [d(i) — u;w;-i] requires M 


complex additions, i.e., 2M real additions. 


In summary, for general complex-valued signals, LMS requires 8M +2 real multiplications 
and 8M real additions per iteration. On the other hand, for real-valued data, LMS requires 
2M + 1 real multiplications and 2M real additions per iteration. 


10.4 LEAST-PERTURBATION PROPERTY 


The LMS algorithm was derived in Sec. 10.2 as an approximate iterative solution to the 
linear least-mean-squares estimation problem (10.1), in the sense that it was obtained by 
replacing the actual gradient vector in the steepest-descent implementation (10.4) by an 
instantaneous approximation for it. It turns out that LMS could have been derived in a 
different manner as the exact (not approximate) solution of a well-defined estimation prob- 
lem: albeit one with a different cost criterion. We shall pursue this derivation later in 
Chapter 45 — see Sec. 45.4, when we study robust adaptive filters. Until then, we shall 
continue to treat LMS as a stochastic-gradient algorithm and proceed to highlight some of 
its properties. 

One particular property is that LMS can also be regarded as the exact solution to a local 
(as opposed to a global) optimization problem. To see this, assume that we have available 
some weight estimate at time 7 — 1, say, w;. 1, and let (d(2), u;) denote the newly measured 
data. Let w; denote a new weight estimate that we wish to compute from the available data 
(d(i), uj, wi-1) in the following manner. 

Let yz denote a given positive number and define two estimation errors: the a priori 
output estimation error 


eli) Ê d(i) - uwi- (10.11) 


and the a posteriori output estimation error 
1 A ; 
r(i = d(i)-— uw; (10.12) 


The former measures the error in estimating d() by using u;w;—1, i.e., by using the avail- 
able weight estimate prior to the update, while the latter measures the error in estimating 
d(i) by using uwi, i.e., by using the new weight estimate. We then seek a w; that solves 
the constrained optimization problem: 


min |w- wil? subject to r(i) = (1— allut) (10.13) 


Wi 


In other words, we seek a solution w; that is closest to w;_1 in the Euclidean norm sense 
and subject to an equality constraint between r (i) and e(i). The constraint is most relevant 
when the step-size jz is small enough to satisfy |1 — u||u;|?| < 1, i.e., when 


0 < uju? < 2 forall i (10.14) 


This is because, when (10.14) holds, the magnitude of the a posteriori error, r(i), will 
always be less than that of the a priori error, e(i), i.e., |r(i)| < |e(i)| or, equivalently, 
u;w; will be a better estimate for d(i) than u;wi_1 (except when e(i) = 0, in which case 
r(t) = 0 also). One could of course impose different kinds of constraints on r(z) than 
(10.13); these will not lead to the LMS algorithm but to other algorithms. 

The solution of (10.13) can be obtained from first principles as follows. Let dw = 
(wi = wi-1). Then 


Ujdw =  ujWi — ujWi-i 

[usw; — d(i)) + [d(i) — uiwii] 
—r(i) + e(i) 

= glue) 
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where in the last equality we used the constraint (10.13). This calculation shows that the 
optimization problem (10.13) is equivalent to determining a vector dw of smallest Eu- 
clidean norm that satisfies u;jów = u||u;||?e(1), i.e. 


min lów||? subject to u;ów = ul|u;||^e(i) (10.15) 
w 


The constraint ujów = p||u,;||?e(2) amounts to an under-determined linear system of 
equations in dw, and it therefore admits infinitely many solutions. Among all solutions dw 
we desire to determine one that has the smallest Euclidean norm. To do so, we distinguish 


between two cases: 
1. ||u;|? = 0. In this case, u; = 0 and the constraint in (10.15) is satisfied by any dw. 
The dw with smallest Euclidean norm is then dw = 0, so that w; = wj. 1. 


2. |u;||? Æ 0. In this case, it is easy to find at least one solution to u;ów = pul|u;||2e(Z). 
Indeed, 
dw? = uuje(i) (10.16) 


is one such solution since if we multiply ów? by u; from the left we get 
ujóu? = ul|ui|"e(i) 


However, there are infinitely many other solutions to this equation. To see this, let 
ów? + z denote any other possible solution, say, 


uj (Sw? + z) = ulluil?e() 


Now since dw? is a solution, we find that z must satisfy u;z = 0. This means that 
all solutions to u;dw = j||u;|?e(i) can be obtained from dw® by adding to it any 
column vector z that is orthogonal to u; (and there are infinitely many such vectors). 
Still, among all solutions [ów? + z}, it can be verified that the dw° in (10.16) has 
the smallest Euclidean norm since 


llów?l? + liz]? + (6w?)*z + z*àw? 

{|Sw? |? + llel? + w(usz)e*(é) + u(uiz)*e(i) 
lwl? + Iz? +0 +0 

||Sw? |? 


ów? + z|]? 


V 


Finally, using ów? = w; — wj. and (10.16) we arrive at the following expression 
for wi: 

Wi = Wi-1 + uu; (d(i) = uiii] 
which coincides with the LMS recursion (10.10). We have therefore established that 
LMS is the solution to the localized optimization problem (10.13). 


10.5 APPLICATION: ADAPTIVE CHANNEL ESTIMATION 


Before proceeding to the derivation of other similar stochastic-gradient algorithms, we find 
it instructive to describe how such algorithms are useful in the context of several applica- 
tions, including channel estimation, linear equalization and decision-feedback equaliza- 
tion. These are the same applications that we studied before in Secs. 5.2, 5.4, 6.3 and 6.4, 
except that now we are going to solve them in an iterative (i.e., adaptive) manner. We start 
with the problem of channel estimation. 


FIGURE 10.1 Noisy measurements of an FIR channel with an impulse response vector c. 


Figure 10.1 shows an FIR channel excited by a zero-mean random sequence {u(i)}. Its 
output is another zero-mean random sequence (d(i)). At any time instant i, the state of 
the channel is captured by the regressor 


ui —-[ui) wi-l) .. ui-M-1)] 


and its output is given by 
d(i) = uic + v(i) (10.17) 


where the column vector c denotes the channel impulse response sequence, and v(i) is a 
zero-mean noise sequence that is uncorrelated with u;. We again remind the reader of our 
notation for indexing vectors and scalars in this book. We use subscripts as time indices 
for vectors, e.g., u;, and parenthesis as time indices for scalars, e.g., u(i). 

It is further assumed that the moments R, = E uu; c2 = E|d(i)?, and Ray = 
E d(i)u; are all independent of time. We also let (u;, d(i)) denote observed values for the 
random variables [u;, d(i)). It is important to distinguish between a measured value d(i) 
and its stochastic version d(i); similarly for u; and w;. This distinction is relevant because 
while an adaptive filtering implementation operates on the measured data (d(i), u;}, its 
derivation and performance are characterized in terms of the statistical properties of the 
underlying stochastic variables (d(i), ui). 

The channel vector c is modeled as an unknown constant vector. This situation is iden- 
tical to the scenario discussed in Sec. 6.3, where c was estimated by formulating a con- 
strained linear estimation problem — see (6.18). In terms of the notation of the present 
section, the rows of the data matrix H from Eq. (6.18) would be the {u;}, while the entries 
of the observation vector y in (6.18) would be the (d(i)). The solution method (6.18) 
did not require knowledge of {R4u, Ru}. There is an alternative way to estimate c that 
does not require knowledge of ( R4,, Ru} either; it relies on using a stochastic-gradient 
(or adaptive) algorithm. The method can be motivated as follows. Assume we formulate 
the following linear least-mean-squares estimation problem: 


min E |d(i) — ujw|? 
whose solution we already know is 
w? = R} Rau (10.18) 


Then w° coincides with the desired unknown c. This is because if we multiply (10.17) by 
ux from the left and take expectations, we find that 


Eu;d(i) = (Eujui)-c + Eujv(i) 
a 


=0 
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FIGURE 10.2 A structure for adaptive channel estimation. 


so that c = R}! Ra, = w°. Therefore, if we can determine w° then we recover c. How- 
ever, the moments {R4u, Ru} are rarely available in practice so that determining w° via 
(10.18), or even via a related steepest-descent implementation such as 


Wi = wi-1 + u|Ray — Ruwi-1] 


would not be viable. Instead, we can appeal to a stochastic-gradient approximation for 
estimating w°. Using measurements (d(i), u;), we can estimate w° (and, hence, c) by 
using, e.g., the LMS recursion: 


Wi = Wi—1 + put (d(i) = wiwi] (10.19) 


This discussion suggests the structure shown in Fig. 10.2, which employs an FIR filter 
with adjustable weights that is connected to the input and output signals {d(i), u(i)} of 
the channel. At each time instant 7, the measured output of the channel, d(i), is compared 
with the output of the adaptive filter, d(i) = u,w;-1, and an error signal, e(i) = d(i) — 
u;w;-1, is generated. The error is then used to adjust the filter coefficients from w;..; to 
w; according to (10.19). In steady-state, if the step-size u is suitably chosen to guarantee 
filter convergence (usually, small step-sizes will do), then the error signal will assume small 
values, and the output d(i) = u;w;_1 of the adaptive filter will assume values close to d(i). 
Consequently, from an input/output perspective, the (adaptive filter) mapping from u(i) to 
d(i) will behave similarly to the channel, which maps u(i) to d(i). This construction 
assumes that the adaptive filter has at least as many taps as the channel itself; otherwise, 
performance degradation can occur due to under-modeling. 


10.6 APPLICATION: ADAPTIVE CHANNEL EQUALIZATION 


Our second application is linear channel equalization. In Fig. 10.3, data symbols {s(-)} 
are transmitted through a channel and the output sequence is measured in the presence of 
additive noise, v(i). The signals (v(-), s(-)) are assumed uncorrelated. The noisy output 
of the channel is denoted by u(i) and is fed into a linear equalizer with L taps. At any 
particular time instant 7, the state of the equalizer is given by 


u-|u(i) uli-1) ... ulGi-L+1) | 


It is desired to determine the equalizer tap vector w in order to estimate the signal d(i) = 

s(i — A) optimally in the least-mean-squares sense. This application coincides with the 

one described in Sec. 5.4 except that now, in conformity with the notation of this chapter, 

we are denoting the input to the equalizer by u(i) and the symbol to be estimated by d(i). 
Clearly, the equalizer w° that solves 


min E|d(i) — u;w|? 


is given by 
v? = RT Ray (10.20) 


where R, = Euju; and Ra, = Ed(i)u;. In Sec. 5.4 we assumed knowledge of the 
channel tap vector c (assumed FIR) and used it to evaluate { Rau, Ru} — refer to equations 
(5.15), which were defined in terms of a channel matrix H. However, in practice, knowl- 
edge of {Rau, Ru} and even c cannot be taken for granted. Actually, more often than not, 
these quantities are not available. For this reason, determining w? via (10.20), or even via 
a related steepest-descent implementation such as 


wi = wii + u[Ra, — Ruwi-1] 
may not be viable. In such situations, we can appeal to a stochastic-gradient approximation 
for estimating w°. Assuming an initial training phase in which transmitted data {d(i) = 
s(t — A)} are known at the receiver (i.e., at the equalizer), we can then use the available 


measurements (d(i), uj) to estimate w? iteratively by using, e.g., the LMS recursion: 


w;-—wi-i + uud(i)— ujwi-i, d(i) = s(i — A) (10.21) 


v(i) 


Sou cem 


Linear equalizer 


FIGURE 10.3 Linear equalization of an FIR channel in the presence of additive noise. 
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FIGURE 10.4 Adaptive linear equalization of a channel in the presence of additive noise. 


This discussion suggests that we consider the structure of Fig. 10.4, with an FIR filter 
with adjustable weights that is connected in series with the channel. At each time instant 
i, the symbol d(i) = s(i — A) is compared with the output of the adaptive filter, (i — A), 
and an error signal, e(i) = d(i) — u;wi—1, is generated. The error is then used to adjust 
the filter coefficients from w;—1 to w; according to (10.21). In steady-state, the error signal 
will assume small values and, hence, the output §( — A) of the adaptive filter will assume 
values close to s(i — A). Consequently, from an input/output perspective, the combination 
channel/equalizer, which maps s(i) to §(i— A), behaves “like” a delay system with transfer 
function z^. Observe that this scheme for adaptive channel equalization does not require 
knowledge of the channel. 

In practice, following a training phase with a known reference sequence (d(i)), an 
equalizer could continue to operate in one of two modes. In the first mode, its coefficient 
vector w; would be frozen and used thereafter to generate future outputs (3(; — A)}. This 
mode of operation is appropriate when the training phase is successful enough to result in 
reliable estimates $(7 — A), namely, estimates that lead to a low probability of error after 
they are fed into a decision device, which maps á(i — A) to the closest point in the symbol 
constellation, say, 5(¢ — A). However, if the channel varies slowly with time, it may be 
necessary to continue to operate the equalizer in a decision-directed mode. In this mode of 
operation, the weight vector of the equalizer continues to be adapted even after the training 
phase has ended. Now, however, the output of the decision device replaces the reference 
sequence (d(i)) in the generation of the error sequence (e(i)). Figure 10.5 illustrates the 
operation of an equalizer during the training and decision-directed modes of operation. 


10.7 APPLICATION: DECISION-FEEDBACK EQUALIZATION 


Our third application is decision-feedback equalization. Again, data symbols {s(-)} are 
transmitted through a channel and the output sequence is measured in the presence of ad- 
ditive noise, v(i). The signals {v(-), s(-)) are assumed uncorrelated. The noisy output of 
the channel is denoted by u(i) and is fed into a feedforward filter with L taps, as indicated 


in Fig. 10.6 for the case L = 3. The output of the decision device is fed into a feedback 
filter with Q taps; this filter works in conjunction with the feedforward filter in order to 
supply the decision device with an estimator 8(? — A). Assuming correct decisions, at any 
particular time instant i, the state of the equalizer is given by 


ui-ls(i-A-1) .. si-A-Q) | uli) ... wi-L-*1)] 


That is, it contains the states of both the feedback and the feedforward filter. It is then 
desired to determine an equalizer tap vector 


w © col(-b(1), —(2),..., —Ó(Q), (0), f), .... F(L - 1) 


that estimates d(i) = s(t — A) optimally in the least-mean-squares sense, where the 
(—b(i)) denote the coefficients of the feedback filter while the (/(i)) denote the coef- 
ficients of the forward filter. This application coincides with the one described in Sec. 6.4, 
except that now, in conformity with the notation of this chapter, we are denoting the in- 
put to the equalizer by u(?) and the symbol to be estimated by d(i). We are also rely- 
ing on the formulation of the decision-feedback equalizer as a linear estimation problem 
(cf. Eq. (6.35)). 
Clearly, the equalizer w° that solves min E|d(i) — u;w]|? is 


w? = Ry! Ra (10.22) 


where R, = Eužu; and R4, = Ed(i)u*. In Sec. 6.4, and also Prob. I.41, we assumed 
knowledge of the channel tap vector c (assumed FIR) and used it to evaluate {Rau, Ru} — 
refer to equations (6.33) and (6.34), which were defined in terms of a channel matrix H. 
However, in practice, knowledge of ( Rau, Ru, c) is generally unavailable. For this reason, 
determining w? via (10.22), or even via a related steepest-descent implementation such as 


wi = wis, + p[Ra, — Ruwi] 


would not be viable. Instead, we can appeal to a stochastic-gradient approximation for esti- 
mating w°. Assuming an initial training phase in which transmitted data {d(i) = s(i— A)] 


di 
m" Q) 
2 Training 
Decision 
directed 


FIGURE 10.5 An adaptive linear equalizer operating in two modes: training mode and decision- 
direction mode. 
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FIGURE 10.6 A decision-feedback equalizer. It consists of a feedforward filter, a feedback filter, 
and a decision device. 


are known at the receiver (i.e., at the equalizer), we can then use the available measure- 
ments (d(i), u;) to estimate w° iteratively by using, e.g., the LMS recursion: 


wi = wii + uujd(i)- uwwii, dli) = sli- A) (10.23) 


Thus, consider the structure shown in Fig. 10.7, with FIR filters with adjustable weights 
used as feedforward and feedback filters. The top entries of w; would correspond to the 
coefficients of the feedback filter, while the bottom entries of w; would correspond to 
the coefficients of the feedforward filter. Likewise, the leading entries of the regressor 
correspond to the state of the feedback filter, while the trailing entries of the regressor 
correspond to the state of the feedforward filter. 

Figure 10.7 depicts two modes of operation: training and decision-directed. During 
training, delayed symbols are used as a reference sequence. At each time instant i, the 
symbol d(i) = s(i — A) is compared with the output of the adaptive filter, 8(i — A) (which 
is the input to the decision device), and an error signal, e() = d(i) — u;w;_1, is generated. 
The error is then used to adjust the coefficients of the feedback and feedforward filters 
according to (10.23). During training, the state (or regressor) of the equalizer is given by 


uz[si-A-1) ... si-A-Q) | ui) ... wi-L41)] 


while during decision-directed operation, the signal d() is replaced by the output of the 
decision device, à(i — A), so that the state of the equalizer is then given by 


u-[3à3(—-A-1) .. s&(i-A-Q) | ui) ... wi—-L41)] 


10.8 ENSEMBLE-AVERAGE LEARNING CURVES 


It is often necessary to evaluate the performance of a stochastic-gradient algorithm. A 
common way to do so is to construct its ensemble-average learning curve, which is defined 
below. 


Training 


Decision 
directed 


L3 Feedback 


FIGURE 10.7 Adaptive decision-feedback equalization. Both modes of operation are shown: 
training and decision-directed operation. 


Recall that for least-mean-squares estimation, the learning curve of a steepest-descent 
method was defined in Sec. 9.5 as (cf. (9.1)): 


where w;.; is the weight estimate at iteration 7 — 1 that is given by the steepest-descent 
algorithm. Evaluation of J(i) would require knowledge of (o2, Rau, Ru}. However, in 
a stochastic-gradient implementation, we do not have access to this statistical information 
but only to observations of the random variables d and u, namely (d(i), u;}. If we replace 
d by d(i) and u by u; in the above expression for J(i), then the difference d — ww;_1 
becomes d(i) — u;w;—1, with the w;_1 now denoting the weight estimate that is obtained 
from the stochastic-gradient implementation (e.g., LMS). We have denoted the difference 
d(i) — ujwi-1 by e(i) earlier in this chapter and called it the a priori output estimation 
error, 
e(i) = d(i) — uiwi-i1 


We can then estimate the learning curve of an adaptive filter as follows. We run the algo- 
rithm for a certain number of iterations, say, for 0 < i < N. The duration N is usually 
chosen large enough so that convergence is observed. We then compute the error sequence 
(e(i)) and the corresponding squared-error curve {|e(i)|?, 0 < i < N}. We denote this 
squared-error curve by 


leor. O<i< NJ 


with the superscript ‘!) used to indicate that this curve is the result of the first experiment. 
We then run the same stochastic-gradient algorithm a second time, starting from the same 
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initial condition w_, and using data with the same statistical properties as in the first run. 
After L such experiments, we obtain L squared-error curves, 


Leo. lea. bates af}, 0<i<N 


The ensemble-average learning curve, over the interval 0 € i < N, is defined as the 
average over the L experiments: 


The averaged curve J (i) so defined is a sample-average approximation of the true learning 
curve J(i). 


Example 10.1 (Learning curves) 


We illustrate the construction of learning curves by considering an example in the context of channel 
estimation, as described in Sec. 10.5. The impulse response sequence of the channel is chosen as 
c = col(1,0.5, —1, 2), i.e., its transfer function is 


C(z) 21--0.527! — 27? 4 227? 


The channel impulse response, along with its magnitude frequency response, are shown in Fig. 10.8. 


Impulse response Frequency response 
2 T T 20 
10 
E 19 
A 8 oj 
In 
-10; 
-1 : -20 —————————————————JÀ 
0 1 2 3 0 1 2 3 
Tap index € (rad/sample) 


FIGURE 10.8 The impulse-response sequence and the magnitude-frequency response of the 
channel C(z) = 1-4- 0.527! — 27? + 227%, 


White input data {u(i)} of unit variance is fed into the channel and the output sequence is ob- 
served in the presence of white additive noise of variance 0.01. A total of N = 600 samples 
{u(i), d(i)} are generated and used to train an adaptive filter with M = 4 taps. The filter is trained 
by using the LMS algorithm of this chapter with step-size 4 = 0.01, as well as two other algorithms 
derived in Chapters 11 and 14 for comparison purposes, namely, the so-called «- NLMS algorthm 
with step-size 4 = 0.2 and e = 0.001, and the RLS algorithm with à = 0.995 and the same value 
of e. All filters start from the same initial condition w_; = 0. 

Figure 10.9 shows two typical instantaneous squared-error curves for each of the algorithms over 
the first 200 iterations, i.e., the rows show plots of |e(z)|? versus time in two random simulations 
for each algorithm. Observe how the curves die out quicker for e—NLMS and RLS relative to LMS. 
By averaging L = 300 such curves for each algorithm, we obtain the ensemble-averaging learning 
curves shown in Fig. 10.10. These curves illustrate the fact that the convergence speed increases as 
we move from LMS to e—NLMS to RLS. d 


Two typical ensemble realizations for three separate algorithms 
LMS LMS 
30 20 
a 20 N 
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FIGURE 10.9 Typical squared-error curves for LMS, e-NLMS, and RLS. 


Ensemble-average learning curves for three algorithms 
obtained by averaging over 300 experiments 
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FIGURE 10.10 — Ensemble-average learning curves for LMS, e-NLMS, and RLS obtained by 
averaging over 300 experiments. 
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Normalized LMS Algorithm 


N.. that we have illustrated the application of adaptive filters (and specifically LMS) in 
several contexts, we proceed to the derivation of other similar stochastic-gradient methods. 
In this chapter we motivate the so-called Normalized LMS algorithm. 


11.1 INSTANTANEOUS APPROXIMATION 


Assume again that we have access to several observations of the random variables d and u 
in (10.1), say, (d(0), d(1), d(2), d(3), .. .) and (uo, ui, u2, ua, ...). The normalized LMS 
algorithm can be motivated in much the same way as LMS except that now we start from the 
regularized Newton’s recursion (10.8) and assume that the regularization sequence {e(7)} 
and the step-size sequence u(i) are constants, say, e(i) = e and u(i) = p. 
Thus, using 
wi = wi- + u [el + RU [Rau — Ruwi-1] (11.1) 


and replacing the quantities (eI + R,,) and (Rau — Ruwi—1) by the instantaneous approxi- 
mations (el+užu;) and už [d(z) —u;w;_1], respectively, we arrive at the stochastic-gradient 
recursion 

wi = wisi ulel ufui] ) uf (d(i) — uwi] (11.2) 


This recursion, in its current form, requires the inversion of the matrix (eI + užu;) at each 
iteration. This step can be avoided by reworking the recursion into an equivalent simpler 
form. Thus, note that (eI + ufu;) is a rank-one modification of a multiple of the identity 
matrix, and the inverse of every such matrix has a similar structure. To see this, we simply 
apply the matrix inversion formula (5.4) to get 


e? 


= T+e]uil? Uu; ui (11.3) 


[eI + ufu]! = eH 


where the expression on the right-hand side is a rank-one modification of «^11. If we now 
multiply both sides of (11.3) by uj from the right we obtain 


l,,* —1,,* € 


* 
jI + uju] wj = eu E 


illl cr ug 
which is a scalar multiple of už. Substituting into (11.2) we arrive at the eE—NLMS recur- 
sion: 


Wi = Wi-1 + uz [d(i) — uiwi—1] , +20 (11.4) 


C n 
e+ lul? 
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We refer to (11.4) as e-NLMS as opposed to simply NLMS, in order to highlight the pres- 
ence of the small positive number e; the terminology NLMS will be reserved for e = 0. 


Algorithm 11.1 (c-NLMS algorithm) Consider a zero-mean random variable d 
with realizations (d(0), d(1),...), and a zero-mean random row vector u with 
realizations (ug, ui,...}. The optimal weight vector w° that solves 


min E|d — uu? 
w 
can be approximated iteratively via the recursion 


Wy = Wir + u;[d(i) — uiwi-1ı], $20, w- = initial guess 


Ld 
e lui 


where jz is a positive step-size and e is a small positive parameter. 


Comparing e—NLMS with the LMS recursion (10.10), we find that the update direction 
in LMS is a scaled version of the regression vector už, namely, puže(i). The “size” of the 
change to w;—1 will therefore be in proportion to the norm of už. In this way, a vector uj 
with a large norm will generally lead to a more substantial change to w;.., than a vector uj 
with a small norm. Such behavior can have an adverse effect on the performance of LMS 
in some applications, e.g., when dealing with speech signals, where intervals of speech 
activity are usually followed by intervals of pause so that the norm of the regression vector 
can fluctuate appreciably. In comparison, the correction term that is added to w;. ; in the 
€-NLMS recursion (11.4) is normalized with respect to the squared-norm of the regressor 
ui; hence the name normalized LMS. Moreover, the positive constant e in the denominator 
avoids division by zero or by a small number when the regressor is zero or close to zero. 

Moreover, as can be seen from (11.4), e- NLMS employs a time-variant step-size of 
the form u(i) = u/(e + ||u;||?), as opposed to the constant step-size, u, which is used by 
LMS. Now since e- NLMS was obtained as a stochastic-gradient approximation to New- 
ton's method, and given the superior convergence speed of Newton's recursion (10.8) as 
compared to the standard steepest-descent recursion (10.4), we expect e-NLMS to exhibit a 
faster convergence behavior than LMS. This is indeed the case, and this intuitive argument 
will be formalized later in Chapters 24 and 25 when we study the transient behavior of 
LMS and e-NLMS. 


11.2 COMPUTATIONAL COST 


Compared to LMS, the recursion for e-NLMS requires the evaluation of two additional 
quantities: 

1. One inner product calculation is needed to evaluate ||u;|?. This calculation is a 
special inner product since it involves a vector and its conjugate transpose. It can 
therefore be evaluated more efficiently than a generic inner product. To see this, let 
us denote an arbitrary entry of u; by a + jb, then evaluation of ||u; ||? requires M — 1 
real additions of M terms of the form |a|? + |b|?. Each such term requires 2 real 
multiplications and one real addition. Therefore, the computation of ||u;||? requires 
2M real multiplications and 2M — 1 real additions. 


2. One real addition and one real division is needed in order to evaluate 4/ (e + ||u; ||”). 
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Therefore, for general complex-valued data, each iteration of e-NLMS requires 10M + 2 
real multiplications,10M real additions, and one real division. On the other hand, for 
real-valued data, each iteration of e-NLMS requires 3M + 1 real multiplications, 3M real 
additions, and one real division. 


Regressors with Shift-Structure 
When the regressor u; possesses shift-structure, the computational cost of e-NLMS can in 
principle be reduced to a figure that is comparable to that of LMS. So assume 


uj = | wi) u(i—1) u(i—2) ... ui — M $1) (11.5) 


for some sequence of values {u(i)}. Then two successive regressors, u; and u;—1, will 
only differ in two entries since 


ui = | u(i—-1)u(i-2)... uli- M 4) u(i- M)| (11.6) 


The boxed entries in u; and u;..; are common to both regressors. It follows that 
lui? = luci? — lui - M)? + juli)? (11.7) 


so that the update of the successive squared-norms of the regressors can be accomplished 
as follows. For general complex-valued data, it requires 2 real multiplications and 1 real 
addition to evaluate each of |u(i)|? and |u(i— M)|?, and 2 real additions to evaluate the sum 
lu;—il? — lu(i— M)? + [u(i)|?. We therefore find that for general complex-valued 
data and for regressors with shift-structure, each iteration of e-NLMS requires 8M + 6 
real multiplications, 8M + 5 real additions, and one real division. On the other hand, for 
real-valued data and for regressors with shift-structure, each iteration of e-NLMS requires 
2M +3 real multiplications, 2M + 3 real additions, and one real division. We may remark 
that, in practice, use of (11.7) is problematic due to the accumulation of rounding errors, 
which can ultimately destroy the nonnegativity of ||u;||?. 


11.3 POWER NORMALIZATION 


Observe further that the e-NLMS recursion (11.4) can be written as 


M Pe 
Wi = Wi-1 + aoa Uj; (d(i) — ui Wii] 


where we scaled both the numerator and the denominator of the term 1/ (e + ||u;||?) by M 
or, more compactly, as 


Wi = Wi-1 + uj [d(i) — uiwi-1] (11.8) 


SEN maa 
e  [uill?/M 
for some smaller step-size yz’ and regularization parameter e'. This rewriting of e—NLMS 


suggests a method for approximating ||u;||?. The method replaces ||u;]|2/M by a variable 
p(t) that is updated as follows: 


Bp(i-1)- (1-8)lu(i)?, p(-1) =0 (11.9) 


where 3 is a positive scalar chosen from within the interval 0 < 8 < 1. In this way, the 
€-NLMS recursion (11.8) would be replaced by 


wi = w1 + Saul {d(i) — uwi, i20 (11.10) 


T e+pli) t 


where we are restoring the notation {,, €} for the step-size and the regularization parame- 
ter in order to avoid an explosion of notation; but it is understood that the step-size in the 
implementation (11.10) is approximately M times smaller than the corresponding step-size 
in the e-NLMS implementation (11.4). We refer to (11.10) as e-NLMS with power normal- 
ization since p(i) can be interpreted as an estimate for the power of the input sequence 
{u(j)} — see Prob. IIL24. 

The inclusion of the parameter ( in (11.9) is meant to introduce memory into the recur- 
sion for p(i). Indeed, note from (11.9) that 


pli) = 0 — 8) 87) 
j=0 


so that input data {u(j)} in the remote past are weighted less heavily than most recent 
data. Now since, for complex data, the evaluation of p(i) requires four real multiplica- 
tions and three real additions, we find that for general complex-valued data, each iteration 
of recursion (11.10) requires 8M + 6 real multiplications, 8M + 4 real additions, and one 
real division. On the other hand, for real-valued data, each iteration of recursion (11.10) 
requires 2M + 4 real multiplications, 2M +3 real additions, and one real division. Usually, 
8 is chosen as a power of 271 so that multiplications by @ and (1 — 8) could be imple- 
mented digitally by means of shift registers. Therefore, in principle, we can ignore the cost 
of these two required multiplications. In the text, however, we choose to account for more 
general choices of £. 


Algorithm 11.2 (c-NLMS algorithm with power normalization) Consider the 
same setting of Alg. 11.1. The optimal weight vector w? can be approximated 
iteratively via the recursions 


Bp(i-1)- (1-)lu(a)?, p»(-120 
Po-ufdG)-wwiah i20 


p(i) 


Wi = Wi-1rt 


e+ p(t) 


where 0 < 8 < 1 (usually close to one), yu is a positive step-size, and the 
regression vector u; is assumed to possess shift-structure as in (11.5). 


LMS with Time-Variant Step-Sizes 
More generally, a stochastic-gradient algorithm with a generic time-variant step-size can 


be obtained from the steepest-descent method (10.6) if we replace Ra, — Ruwi-1 by 


uj [d(i) — uiwi-1] 


SIn some applications such as speech processing, a constant value of 8 may not be adequate since speech signals 
generally undergo significant fluctuations. 
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which leads to 


Wi = Wi-1 + ulijuž[d(i) = uiWwi—ı], i20 (11.11) 


The e-NLMS algorithm, as well as its power-normalized variant, are special cases of 
(11.11) with particular choices for u(i). 


11.4 LEAST-PERTURBATION PROPERTY 


The e-NLMS algorithm was derived in Sec. 11.1 as an approximate solution to the linear 
least-mean-squares estimation problem (10.1), in the sense that it was obtained by replac- 
ing the gradient vector and Hessian matrix in Newton's recursion (11.1) by instantaneous 
approximations for them. Later in Sec. 45.2, we shall show that e-NLMS can also be de- 
rived as the exact (not approximate) solution of a well-defined estimation problem; albeit 
one with a different cost criterion. Until then, we shall continue to treat e-NLMS as a 
stochastic-gradient algorithm and proceed to highlight some of its properties. 

One particular property is that e-NLMS, just like LMS, can also be regarded as the exact 
solution to a local optimization problem. To see this, we proceed as in Sec. 10.4. Thus, 
assume that we have available some weight estimate at time 7 — 1, say, w;_1, and let 
{d(i), u;) denote the newly measured data. Let w; denote a weight estimate that we wish 
to compute from the available data {w;_1, d(?), u;} in the following manner. Let jz and € 
be two given positive numbers, and consider the problem of determining w; by solving the 
constrained optimization problem: 


.42 
Sw e(i) (11.12) 


where {r(i), e(i)} are defined as in (10.11) and (10.12). In other words, we seek a solution 
w; that is closest to wj. .1 and ensures an equality constraint between r(1) and e(i). This 
constraint is different from the one we employed earlier in (10.13) for LMS, and it becomes 
relevant when the step-size 1, is small enough to satisfy 


poll us|? 
po PU D z 
| e+ |lu]? 
i.e., when , 
0c « oe for all i (11.13) 


This is because when (11.13) holds, the magnitude of the a posteriori error will not exceed 
that of the a priori error, i.e., |r(i)| < |e(i)| except when e(i) = 0, in which case r(i) = 0 
also. In contrast to condition (10.14) in the LMS case, we now find from (11.13) that the 
bound on y is more relaxed. In particular, any choice of p in the interval (0, 2) will do and 
any such choice is independent of u;. 

The solution of the optimization problem (11.12) can be obtained in precisely the same 
way as we did in Sec. 10.4 for LMS. The argument will lead us again to the e-NLMS 
recursion: 


Wi = Wi-1 + uj [d(1) — wiwi] , i>0, w = initial guess 


cuu Le 
e+ |u|? 


TE 


Other LMS-Type Algorithms 


The idea of using instantaneous approximations to devise stochastic-gradient algorithms 
from steepest-descent implementations is not limited to quadratic cost functions as in 
(10.1). For instance, in Probs. III.16-III.21 we formulate steepest-descent methods for 
a variety of other cost functions. If we then employ instantaneous approximations for the 
associated gradient vectors and Hessian matrices, we would obtain other well-known adap- 
tive algorithms. In this chapter, we list the recursions for various such algorithms of the 
blind and non-blind types. Non-blind methods are so-called because they employ a refer- 
ence sequence {d(i)} in their update recursions. On the other hand, blind algorithms do 
not use a reference sequence. 


12.1 NON-BLIND ALGORITHMS 


We first list several non-blind methods derived in Probs. III.26-III.31. The derivations in 

the problems lead to the following statements for the so-called sign-error LMS, leaky-LMS, 

least-mean-fourth (LMF), and least-mean-mixed-norm (LMMN) algorithms. For all state- 

ments below, we consider a zero mean random variable d with realizations (d(0), d(1),...}, 
and a zero-mean random row vector u with realizations (uo, u1, ...). Moreover, u is a pos- 

itive step-size (usually small). 


Algorithm 12.1 (Sign-error LMS algorithm) The weight vector w? that solves 
(in terms of the mean of the lı norm — cf. Probs. III.14-111.15 and 141.29): 


minE |d — uw| 
w 


can be approximated iteratively via the recursion 


wi = Wi- + pužcsgn[d(i) — uwi-i], i20 


In the above statement, the sign of a complex number, x = 2, + jzi, is defined as 
(cf. Prob. IIL.15): 


A X +1 ifa>0 
csgn(z) = sign(z,) + jsign(z;) where sign(a) 2 4 -1 ifa<0O (12.1) 
0 ifa=0 
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in terms of the sign of a real number and j = /—1. 


Algorithm 12.2 (Leaky-LMS algorithm) The weight vector w^ that solves 
(cf. Probs III.12 and 111.26): 


min [of[w|? + E|d — uw? ] 
w 
for some positive constant œ, can be approximated iteratively via the recursion 


w; = (1 — uo)wi-1-c pujz[d(i) -—uwy-1], i20 
pa en ee 


Algorithm 12.3 (LMF algorithm) The weight vector w? that solves (cf. Probs. 


11.16 and 111.30): 
min E|d — uv|* 
w 


can be approximated iteratively via the recursion 


w; = wi-i-pu;e()leG). i20 


Algorithm 12.4 (LMMN algorithm) The weight vector w° that solves (cf. Probs 
H117 and 111.31): 


min E | e? t 50 - lef |. e-d-uw 


for some constant 0 < 6 < 1, can be approximated iteratively via the recursion 


wi = wi-1 + puje(i)[S+(1— Ale)? ], i20 


Listing of Non-Blind Algorithms 

Table 12.1 summarizes the stochastic-gradient algorithms derived so far, along with some 
other algorithms that we derive in the sequel. In the table, and in other places in the book, 
we use the notation 0 < A < 1, with the symbol <, to mean that A is a positive scalar 
that is close or equal to one (e.g., A = 0.995 or A = 1). One way to express this condition 
more precisely would be to say that A is a positive scalar inside the interval (0, 1] and such 
that 1 — A « 1. 

Tables 12.2 and 12.3 show the respective estimated computational costs per iteration for 
both cases of real- and complex-valued data. The costs in the second line of Tables 12.2 and 
12.3 correspond to the generic recursion (11.11), and they assume that the values of u(i) 
are available; if these values need to be computed then the costs will of course change. For 
completeness, in the multiplications columns in both tables, the entries between parenthesis 
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TABLE 12.1 A listing of several stochastic-gradient algorithms for 7 > 0; some are derived in SECTIONS AS 
the text while others are motivated and derived in the problems. In all of them, the initial weight BLIND 
estimate is specified ati = —1 and the signal e(i) denotes the a priori output estimation error, ALGORITHMS 


e(i) = d(i) — U;Wi-1. 


Algorithm Recursion 


LMS with wi = Wi-1 + puie(i) 
constant step-size 


LMS with wi = wi-1 + wli)uze(i) 
time-variant step-size 


e-NLMS p(t) = 8p(i - 1) + (1— Alul), p(-1) = 0 

with power normalization | wi = wi-1 + - E Ü uzeli) 

sign-error LMS wi = wi-i + pui csgn[e(i)] 

leaky-LMS wi = (1 — uo)wi-i + puje(i), a » 0 

LMF wi = win + uute(i)lei) _| 

LMMN wi = wi-i-t puje(i)[ d + (1— leli) ] 
0<6<1 


Wi = Wi-1 + Piuie(i) 
Pa =e ll, O0<A<1 


-1 -lp Pip. 
Gauss-Newton (GN) P, À a. A Pu Pi- 


E G-a 4 A- Iu Pu] 
wi = wi-1 + uP; uje(i) 
P_,=e€"'l, aœ > 0 (small), O0<A<1 


indicate the number of multiplications that would result if the parameters (4, 3, a, ô}, 
whenever meaningful, were chosen as powers of 271. Moreover, M is the filter order. 

Observe in particular that all LMS-variants require on the order of M floating point 
operations (multiplications and additions) per iteration. We therefore say that they are 
O(M) algorithms, with the notation O(.M) signifying “on the order of M” or "a multiple 
of M". 

It is worth noting that, in practice, besides addition and multiplication operations, the 
complexity of an algorithm is also judged in terms of how often it accesses the memory 
(i.e., in terms of the frequency of its read and write operations). 
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TABLE 12.2 A comparison of the estimated computational cost per iteration for several 
stochastic-gradient algorithms for real-valued data in terms of the number of real multiplications, 


real additions, real divisions, and comparisons with zero (or sign evaluations). 


/ | sign 
LMS with 

constant step-size 
LMS with 
time-variant step-size 


e-NLMS with 
power normalization 


sign-error LMS 


RLS M?+5M+4+1 M? -3M 1 
GN M?+7M+1|M?+3M+4+1! 3 


12.2 BLIND ALGORITHMS 


We now list several blind algorithms derived in Probs. III.33-II.36; such methods are use- 
ful in blind channel equalization applications — see Computer Project III.4. In the state- 
ments below we consider a zero-mean random row vector u with realizations (uo, w1,... 


and a positive scalar y. Moreover, jz is a positive step-size (usually small). 


(11.21 and 111.36): 
min E (y — luw|)? 
w 
can be approximated iteratively via the recursion 
«{ 2) ; ; 
wi = wi- + wuz | ys zli) |, zli) = uwii, i20 


OI 


or via the normalized form 


uu; z(i) ; ; ; 
i = Wi- ST , = UQU-1, 2 
Wi = Wi-1 + [uz]? lu i) | 2(i) =ujwi-1, i20 


in both cases, we set w; = wj; when z(i) = 0. 


a a sss M 


Algorithm 12.5 (CMA1-2 and NCMA) A weight vector w° that solves (cf. Probs. 
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TABLE 12.3 A comparison of the estimated computational cost per iteration for several SECTION T22 
stochastic-gradient algorithms for the case of complex-valued data in terms of the number of real BLIND 
multiplications, real additions, real divisions, and comparisons with zero (or sign evaluations). ALGORITHMS 


Algorithm x / | sign 
LMS with 
constant step-size 
LMS with 
time-variant step-size 


€-NLMS 


———  - 
€-NLMS with 8M +6 
power normalization 


sign-error LMS 


leaky-LMS 10M +3 


(8M +2) 


AM? + 16M +1 


AM? - 12M - 1 


AM? - 12M 


AM? -20M +1 


Algorithm 12.6 (CMA2-2) A weight vector w° that solves (cf. Probs. 111.18 
and II1.33): 


f 2 
min E (y - |uuj?) 
w 
can be approximated iteratively via the recursion 


wi = wia-guizü)v-i|z(i)?] z(i-wuwwii i20 


Algorithm 12.7 (RCA) A weight vector w° that solves (cf. Probs. 111.19 and 
11.34): 


min E Juw — y- csgn(uw)|? 
w 
can be approximated iteratively via the recursion 


wi = wi- + uujvesgn(z(i) — z(i)], z()-uwi;i, i20 
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——————— OQ 
Algorithm 12.8 (MMA) A weight vector w° that solves (cf. Probs. 1II.20 and 


111.35): 
min E { [(Re(uw))? -Q + [(Im(uw))? — 7H \ 


and where Re(-) and Im(-) denote the real and imaginary parts of their argu- 
ments, can be approximated iteratively via the recursion 


z(i) = UjWi-1 

a(i) = Re[z(1)) 

b(t) = Im[z(i)] 

e(i) = a(i)[y — a(é)] + Jb(i)[y — 8? (1)] 
wi = Wi-1 + pujze(i) 


——————— ee "BÓ € 


Listing of Blind Algorithms 
Observe that for real-valued data, the RCA recursion reduces to that of CMA1-2. The blind 
recursions are summarized in Table 12.4. 


TABLE 12.4 A listing of several blind stochastic-gradient algorithms derived in the problems. 
In all of them, the initial weight estimate is specified at iteration 1 = —1 and, clearly, it cannot be 
chosen as zero since otherwise the recursions do not update the weight estimate. 


Aigorithm | Recursion 


wi = wii uuzz(i) y - l(i)? ] 
z(i) = uiwi-i 


wi = wWi-1 + uu 


z(i) = UiWi-1 


wi = wi-i- uui [ ycsgn(z(i)) — z(i)) 
z(i) = uiwi-i 


wi = Wi-1 + pui;e(i) 
z(t) = uiwi-i, a(t) = Re(z(i)), b(i) = Im(z(i)) 
eli) = ali)ly - o?(i)] + jo) ly — 9? (0) 


Tables 12.5-12.6 summarize the estimated costs for both real- and complex-valued data 
and for constant step-sizes. For completeness, in the multiplications columns in both ta- 
bles, the entries between parenthesis indicate the number of multiplications that would 
result if the step-size jj is chosen as a power of 271. 


12.3 SOME PROPERTIES 


The algorithms described in this chapter serve purposes different from LMS and e-NLMS. 
1. Different cost functions correspond to different optimality criteria and the resulting 
stochastic-gradient algorithms can behave differently than LMS and e-NLMS under 


TABLE 12.5 Estimated computational cost per iteration for real-valued data in terms of the 
number of real multiplications, real additions, real divisions, and comparisons with zero (or sign 


evaluations), 


TABLE 12.6 Estimated computational cost per iteration for the case of complex-valued data in 
terms of the number of real multiplications, real additions, real divisions, and comparisons with zero 


(or sign evaluations). 


Algorithm x + | / | sign | 

CMA22 | 2M 43 2M 
(2M + 2) 

CMAI-2 2M +2 2M EE] 
(2M +1) 

NCMA 3M -2 | 3M-1 BN 
(3M +1) 

RCA 2M +1 2M MES 

(2M) 

MMA 2M +3 2M 

(2M +1) 


Algorithm x + | / | sign | 

CMA2-2 8M+8 | 8M+2 
(8M +6) 

CMAI-2 8M +4 8M pcr 
(8M + 2) 

NCMA 10M+4 | 10M —1 NB 
(10M + 2) 

RCA 8M +2 8M E 

(8M) 

MMA 8M +6 8M 

(8M + 4) 


varied statistical conditions on the data. We shall see in Parts IV (Mean-Square 
Performance) and V (Transient Performance) that this is indeed the case — see, 
e.g., Sec. 21.5. 


. Sign-error LMS. The motivation for introducing sign-error LMS, especially for 
real-valued data, is due to its computational simplicity. While it may seem from 
Table 12.2 that the cost per iteration of sign-error LMS is similar to that of LMS, the 
point is that the step-size jz is usually selected as a power of 271, say, y = 2^" for 
some integer m > 0. When this is the case, the evaluation of ju] sign[e(i)] in the 
real case can be implemented digitally very efficiently by means of shift registers, 
and we can ignore the M multiplications that are needed for a generic u. In this 
case, we can replace the 2M figure that appears in Table 12.2 by M multiplications. 
This simplification is not possible for LMS because it uses e() instead of its sign. 
Nevertheless, the simplification in computations for sign-error LMS comes at the 
expense of slower convergence. 
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3. Leaky-LMS. The LMS algorithm can suffer from a potential instability problem 


when the covariance matrix R is singular or close to singular (an example is given in 
Prob. IV.40 in the context of fractionally-spaced equalizers). When this happens, the 
weight estimates w; can drift and grow unbounded — see Prob. III.27 for an example 
and also Prob. IV.39 for a more detailed explanation. The leaky-LMS algorithm 
limits the growth of the weight estimates by employing a coefficient, (1— pa), in the 
recursion for the weight vector. We shall study the properties of this algorithm in Part 
V (Transient Performance)- see Probs. V.27-V.32. In particular, we shall see that 
while leaky-LMS solves the drift problem, it nevertheless introduces a bias problem 
in that the mean value of w; will not tend to the optimal solution w° = R;! Ra, of 
the normal equations. 


CHAPTER 1 3 


Affine Projection Algorithm 


The LMS and e-NLMS algorithms were obtained by using simple instantaneous approx- 
imations for the covariance and cross-covariance quantities {Ry,, Ru}. More involved 
algorithms, with better performance but at increased computational costs, can be obtained 
by resorting to more sophisticated approximations for { Rau, Ry}. We illustrate this situa- 
tion by motivating the so-called affine projection algorithm. 


13.1 INSTANTANEOUS APPROXIMATION 


Just like e-NLMS, we again start from the regularized Newton’s recursion (10.8), namely, 
wi = wi- + ulel + Rul”? [Rau — Ruwi] (13.1) 


albeit with a fixed step-size u and a fixed regularization parameter €’. Now, however, we 
shall employ a better approximation for both the covariance matrix, Ru, and the cross- 
covariance vector, Rau. Specifically, we choose a positive integer K (usually K < M, 
where Af x 1 is the size of the weight vector) and replace {R4u, Ru} by the following 
instantaneous approximations: 


" . x 1 : x 
Ry = K 5 ujuj , Rau = K 5 d(j)uj 
j=i-K+1 j=i-K+1 


In other words, at each iteration i, we use the K most recent regressors and the K most 
recent observations, 


(ui, Up ds e SU kai and (d(i), d(i 5 T); Üex di -K+ 1)} 
to compute the approximate values for { Ru, Rau}. 


Let e = ¢'/K. If we introduce the K x M block data matrix 


(K x M) (13.2) 
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and the K x 1 data vector 


d(i) 
d(i — 1) 
di = . (K x 1) (13.3) 
d(i — K 1) 
then we can express (&,, Rau} more compactly as 
B lj 5n 1. 
Ry = riz Ui, Rau = xU di 
so that Newton's recursion (13.1) becomes 
wi = wii + u (el + UTU;) Už [di — Uwi] (13.4) 


Although U;'U; is singular when K < M, the term eI guarantees the invertibility of eI + 
UFU;,. 

Recursion (13.4) requires the inversion of the M x M matrix (el + U7U;) at each 
iteration. Alternatively, we can invoke the matrix inversion formula (5.4) to verify that 


(d--U7U)) ! U? = UF (d -U;U?) ! 
in which case (13.4) becomes 
Wi = Wi-1 + uU; (el + UU?) (di — Ujwii] (13.5) 


This form requires inverting the (usually smaller) K x K matrix (el + U;U;7) at each it- 
eration; recursion (13.5) is also useful even when e = 0 since U;U7 is generally invertible 
when K < M. Algorithm (13.5) is what is known as the affine projection algorithm or, 
more accurately, e—APA, in order to highlight the presence of the regularization factor 
e. In particular, it is seen that when K = 1, e-APA reduces to the e—NLMS recursion of 
Alg. 11.1. More generally, when compared with e-NLMS, and even with the LMS recursion 
of Alg. 10.1, we find that e-APA uses a vector-valued estimation error, e; = d; — U;w;-1, 
as opposed to the scalar-valued error, e(i) = d(i) — u;wj..1, used by LMS and e—NLMS. 
This is because the latter algorithms use only the most recent regressor to update the weight 
vector estimate, whereas e-APA uses the K most recent regressors for this same task. For 
this reason, affine projection algorithms are sometimes called data-reusing algorithms since 
they re-use past regressor and reference data. The integer K is referred to as the order of 
the e-APA filter. 


Algorithm 13.1 (c-APA algorithm) Consider a zero-mean random variable d 
with realizations (d(0), d(1),...}, and a zero-mean random row vector u with 
realizations (uo, u1,...). The optimal weight vector w° that solves 


min E|d — uw? 
w 
can be approximated iteratively via the recursion 
wi = wir + WU} (el + UU?) [di — Uivi-i] 


where (U;,d;) are defined by (13.2)-(13.3) and K is a positive integer that 
denotes the filter order (usually K < M). 


13.2 COMPUTATIONAL COST 


The computational cost of e-APA is a function of its order K. Tables 13.1 and 13.2 show 
the estimated number of real multiplications and real additions that are required in the 
evaluation of specific terms for both cases of real and complex-valued data. The tables 
assume that the cost of inverting a K x K matrix is O(K?) operations (multiplications 
and additions). The main conclusion is that the cost of e-APA is O(K?M) operations per 
iteration. 


TABLE 13.1 Estimated computational cost of e-APA per iteration for real-valued data in terms 
of the number of real multiplications and real additions. 


Term 


| el + USUT 
(TUUN: K? 
(el + UU) [di — Uiwi-1] 
Ur (el + U;U?) | d; — Uiwi-1] 


TOTAL per iteration (K? +2K)M+ | (K?+2K)M+ 
K*-K KK? 


TABLE 13.2 Estimated computational cost of e-APA algorithm per iteration for complex-valued 
data in terms of the number of real multiplications and real additions. 


Term x + 
U.wi-i AKM 2K(2M —1) 
di — Uiwi-a 2K 

UU; AK?*M 2K*(2M — 1) 
eI + UU? 2K 

(el + U;Uf) | AK? AK? 

(el + U;U7)- dy — Uiwii] AK? 2K(2K — 1) 
U? (I+ U;SUT) ! [di — Uiwi-i] AKM 2(2K — 1)M 
Wi 2M 
TOTAL per iteration A(K? +2K)M+ | 4(K?+2K)M+ 

AK? 44K AK? 42K? 


13.3 LEAST-PERTURBATION PROPERTY 


In a manner similar to what we did for e—NLMS in Sec. 10.4, we can similarly motivate 
the affine projection algorithm as the exact solution to a local optimization problem. To 
see this, assume that we have available a weight estimate at time i — 1, say, wj; i, and 
let {d;,U;} denote the data at iteration ¿ (constructed as in (13.2)-(13.3)). Let also 4 
denote a given positive number and define two estimation error vectors: the a priori output 
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estimation error 


and the a posteriori output estimation error 


am 


We can then seek a w; that solves the constrained optimization criterion: 


n ! 2 
min lw wii! ^ sebjeetto r; = (L- uULUZ (E. UU) e| (13.8) 
t 


In other words, we seek a w; that is closest to w;_; in the Euclidean norm sense and subject 
to an equality constraint between r; and ej. As was the case with e—NLMS in Sec. 11.4, 
we show in Prob. III.41 that because of the constraint (13.8), any step-size p in the interval 
(0, 2) guarantees the desirable property ||r;||? < ||e;||? (with equality when e; = 0), i.e., it 
guarantees that U;w; will be a better estimate for d; than U;w;_1. Moreover, the solution 
of (13.8) can be obtained in the same manner as in Sec. 10.4 for LMS. The argument will 
lead us again to the e-APA recursion (13.5) — see Prob. III.42. 


13.4 AFFINE PROJECTION INTERPRETATION 


The formulation (13.8) allows us to explain the reason for the denomination "affine pro- 
jection". It is because a special case of the affine projection recursion (13.5) admits an 
interpretation in terms of projections onto affine subspaces. To see this, refer to (13.8) with 
K < M and consider the special choices jj = 1 and e = 0, in which case (13.8) reduces to 


min |wi- wil? 
4 


subject to r; = 0 


or, equivalently, 


: 2 
m lw: — wil subject to d; = U;w; (13.9) 


and the e-APA update (13.5) becomes 
wi = wii UL (UU?) [di — U;w;-1] (13.10) 


In other words, when u = 1 and e = 0, APA enforces the equality d; = u;w;. In the 
special case K = 1, we recover the e—NLMS scenario: 


: 2 
min [wi — wil subject to d(i) = uius (13.11) 
t 


whose solution is . 
Wi = Wi-1 + —i idli) — UuiWi-1] (13.12) 
lui]? 
In this case, NLMS is enforcing the equality d(?) = u;w;. This observation admits the 
following geometric interpretation. 
For any given data (d(:), u;}, there are infinitely many vectors w that solve d(i) = u;w. 
The set of all such w is an affine subspace (also called a hyperplane or a manifold), denoted 
by Mi, whose defining equation is 


M; £ {set of all vectors w such that uw — d(i) = 0) 


FIGURE 13.1 Manifolds associated with a second-order APA implementation with u = 1, M;-1 
and M,. The estimate w;_1, which lies in M;_1, is updated to a point w; lying in the intersection 
of both manifolds. 


The qualification “affine” is used to indicate that the hyperplane does not necessarily pass 
through the origin w = 0. Given w;_1, and because of (13.11), NLMS selects that partic- 
ular vector w; from this subspace that is closest to w;_1 in the Euclidean norm sense. We 
therefore say that w; is obtained as the projection of w;..; onto the manifold M;. 

When K > 1, on the other hand, we find from (13.9) that the APA recursion (13.10) is 
such that it enforces K equalities (as opposed to a single one): 


d(i) = uui, d(i— 1) = Uj-1U%, Sete d(i— K +1) = UWi-K41UWi 


For each pair of data (d(i — j), u;_;}, there are infinitely many vectors w that satisfy 
d(i — j) = u,;—;w and which define a manifold M ;. The weight vector w; that is computed 
by (13.9)-(13.10) lies in the intersection of the corresponding K manifolds, 


utc 2 Mj 


j2i- K41 


We thus say that w; of APA with = 1 is obtained by projecting onto the intersection of 
the manifolds (M; j = ii — 1,...,4 — K +1}. Figure 5.8 illustrates this construction 
for the case K = 2. Two manifolds are shown in the figure, M; and M;..;. In the figure, 
the estimate w;—1 lies in M,;_ while the updated estimate w; lies in the intersection of 
both manifolds, Mi- N Mi. 


Variations 
There are variations of the affine projection algorithm. One such variant is the so-called 
partial-rank algorithm (PRA), which replaces the update (13.5) by one of the form: 


wi = wi-k + uU? (e + U;U?) | [di — Uiwi-«] (13.13) 
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where the quantities {U;,d;} are still defined as in (13.2)-(13.3). The main difference 
from e-APA is that the weight estimate w; is kept fixed at wj. x during the time instants 
j=t-K+1,i-K+42,...,i—1 and, hence, the weight vector is updated only once every 
K iterations. The computational cost of PRA will therefore be less than that of e-APA by 
a factor of K, namely, O( K M) operations per iteration. 

More generally, a family of affine projection algorithms can be defined as follows: 


Wi = Wi-i-o(K-1) + pu; (el + UU (d; = UiWwi-1-a(K-1)] (13.14) 


where, for example, œ = 0 corresponds to e-APA form (13.5) while a = 1 corresponds to 
PRA. Moreover, the data {U;, d;} can also be taken as 


ui d(i) 
u- d(i— D 
y, £ "d di = ( ; (13.15) 
Ui-(K-1)D d(i (K —1)D) 


with their entries separated by multiples of some delay D. The choice D — 1 corresponds 
to e-APA and PRA, but larger values for D can also be employed. One motivation for 
using D » 1 is to increase the separation, and consequently reduce the correlation, among 
the regressors in U;. As we are going to see in our studies on the performance of adaptive 
filters, starting in Chapter 15, the performance of LMS-type filters tends to be sensitive to 
the correlation of its regression data. And e-APA filters, with D = 1 or D > 1, address 
some of these deficiencies albeit at an increased computational cost over NLMS. 

Finally, it is worth mentioning that the APA form (13.5), with € = 0, can be rewrit- 
ten in an equivalent form in terms of orthogonalized regressors (ü;) that are obtained 
from the original regressors (u;) via a Gram-Schmidt orthogonalization procedure (see 
Prob. III.43). The details are carried out in Prob. IIL44 and they lead to the following 
statement. 


Algorithm 13.2 (APA with orthogonal update factors) Consider the APA up- 
date 
Wi = Wi-1 + pu; (UU?) (d; = Ujwi-i] 
with a full rank regression matrix U; and K < M. This update can be equiv- 
alently implemented as follows. For each iteration i, perform the following 
steps. Start with w = wj; and repeat for k = 0,1,..., K — 1: 


k TK 21 
r Uj kj i kj 
Ui-k = WUi-k I- | d"u$9 
E 2 lūi-j+rll? 
Gk) = diku 
kl k Ui-k (ky. 
gus ue E e 


Then set E 
Wi = (1 = u)wi-i + uw? 


fee ee 


Listing of the Algorithms 


Table 13.3 lists the affine projection algorithms described in this section, while Tables 13.4— 


13.5 compare their computational cost per iteration for real and complex data. 


TABLE 13.3 A listing of three affine projection algorithms for i > 0. In all of them, the initial 


weight estimate is specified at iteration i = —1. 
Algorithm | Recursion 
APA Wi = Wi-1 + uU; (U;U?)-! [di — Uiwi-i] 
Ui d(i) 
Ui = : 2 di- : 
Ui-K41 d(i -K+ 1) 
€—- APA Wi = Wi-it LU; (el + UU?) [di = Uiwi-i] 
PRA Wi = Wi-K + BU; (el + U;U?)- ![d; - Uiwi-x| 


TABLE 13.4 A comparison of the estimated computational cost per iteration for several affine 
projection algorithms for real-valued data in terms of the number of multiplications and real 


(K? +2K)M - KO +K 


additions. 


Algorithm 


TABLE 13.5 A comparison of the estimated computational cost per iteration for several affine 
projection algorithms for complex-valued data in terms of the number of real multiplications and 


real additions. 


Algorithm 


APA A(K? - 2K)M - AK? -AK | 4(K? -2K)M +4K? +2K? — K 


APA | 4(IÓ - 2K)M -c AK? - AK A(K? +2K)M - AK? - 2K? 


PRA ,| 4(K-2)M - AK? +4 4(K +2)M+4K?+2K 


(K24+2K)M+K?+K?—-K 


€-APA (K? -2K)M - K - K (K? +2K)M +K? - K? 


PRA (K+2)M+K?+1 (K4+2)M+K?+K 
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RLS Algorithm 


A second example of an algorithm that employs a more sophisticated approximation 
for R, is the Recursive-Least-Squares (RLS) algorithm. Although this algorithm can be 
motivated and derived as the exact solution to a well-defined estimation problem with a 
least-squares cost function, as we shall show in detail later in the book (see, e.g., Sec. 30.6), 
we shall motivate it here as simply a stochastic-gradient method. In this way, readers can 
get an early introduction to this important adaptive algorithm. 


14.1 INSTANTANEOUS APPROXIMATION 


Just like e-NLMS and e—APA, we again start from the regularized Newton’s recursion 
(10.8), namely, 


wi = wii (i) [eI + Ru]? [Rau — Ruwi] (14.1) 


and replace 
(Rau = Rywi-1) 


by the instantaneous approximation 
uj (d(i) — uiwi-i] 


Now, however, we replace R, by a better estimate for it, which we choose as the exponen- 
tially weighted sample average 


) 


jo e 
Ru = oy uu 
j-0 


for some scalar 0 < A < 1. Assume first that A = 1. Then the above expression for Ru 
amounts to averaging all past regressors up to time 7, namely, 


x aem 
Ry = ri ui when A= 1 


Choosing a value for A that is less than one introduces memory into the estimation of 
Rau. This is because such a A would assign larger weights to recent regressors and smaller 
weights to regressors in the remote past. In this way, the filter will be endowed with a track- 
ing mechanism that enables it to forget data in the remote past and to give more relevance 
to recent data so that changes in R, can be better tracked by the resulting algorithm. 
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We further assume that the step-size in (14.1) is chosen as 
u(i) = 1/(¢+1) 
whereas the regularization factor is chosen as 
eli) = M*le/(i +1) 


for i > 0 and for some small positive scalar e. This choice for e(?) is such that regulariza- 
tion disappears as time progresses. With these approximations and choices, the regularized 
Newton's recursion (14.1) becomes 


-1 


i 
wi = wi-1 + |A tlel + y Auzu ui [d(i) — ujwi-1] (14.2) 
j-0 


This recursion is inconvenient in its present form since it requires, at each time instant i, 
that all previous and present data be combined to form the matrix 


1 
à, 2 [Atle + 3 xu; (14.3) 
j=0 


which then needs to be inverted. These two complications (of data storing and matrix 
inversion) can be alleviated as follows. Observe from the definition of d; that it satisfies 
the recursion 

$; = Ağ; + Uz ti, =el (14.4) 


Let P; = OF i Then applying the matrix inversion formula (5.4) to (14.4) gives 


AC! Pi-iužui Pi- 


Pi =A! | Pia — piu 
wF Alu Piu} 


| , Pasel (14.5) 


This recursion shows that the update from P;.., to P; requires only knowledge of the most 
recent regressor u;. In this way, at each time instant i, the algorithm only needs to have 
access to the data {w;-1, d(i), uj, P; 1) in order to determine (wj, P;]. The matrix P;_1 
essentially summarizes the information from all previous regressors. 


Algorithm 14.1 (RLS algorithm) Consider a zero-mean random variable d with 
realizations (d(0), d(1),...), and a zero-mean random row vector u with re- 
alizations (uo, u1,...). The weight vector w° that solves 


min E|d — uw|* 
w 


can be approximated iteratively via the recursion 


-1 
A  Piziu;uiPi-i 


P; = X`} | Pia - 
E 1+ Alu; P;_-1ut 


Wi = Wi-1 + P; u; (d(i) = UiWi-i], i 2 0 


with initial condition P. = e-!1I and where 0 « A < 1. 
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TABLE 14.1 Estimated computational cost of the RLS algorithm per iteration for real-valued 
data in terms of the number of real multiplications, real additions, and real divisions. 


Term x + / 


dA7tut 
P,_,(A7! ut) M? M(M - 1) 
m. M M —1 
1+ uiP;-i(A Iul) 1 
l/[ + wP- (A tu] 1 
(Au Paul) - 1+ uiP.-i(A-lul) 
A uiPi-iui 


(AT HPi-1ul) x 


M 


1+A-!u:P;-1u* 


Piu; [d(i) kso ui Wi-1] 


TOTAL per iteration M?-5M +1 | M?+3M | 1 


In Chapters 29-30, we shall study RLS in greater detail and show, among other results, 
that it is in fact the exact solution of a least-squares problem (and, hence, the origin of its 
name). 


14.2 COMPUTATIONAL COST 


The computational cost of RLS is one order of magnitude higher than that of LMS and 
e-NLMS. To see this, we show in Tables 14.1 and 14.2 the estimated number of real mul- 
tiplications and real additions that are required in the evaluation of specific terms for both 
cases of real and complex-valued data. In particular, the listing in the tables assumes that 
the RLS calculations are performed in the following manner: 


1 
1+ Alu; Pit 


Pu; BE, (A7luj)) — (Pi-1(A7*u})) : (Atu; Pj iuf) . 


wi-1 + (Pi uj) - (d(i) — (uiwi-i1)) 


E 
i 


This manner of calculation is chosen only for convenience of exposition. The order by 
which the quantities are being computed here need not be the best one in practice; it is 
certainly not the only one. While other ways of carrying out the calculations may result in 
a slightly different computational cost, they will all lead to the same order of magnitude, 
namely, O( M7). Moreover, in this chapter, we are assuming that all computations are per- 
formed in infinite precision. In practice, however, LMS and RLS need to be implemented 
in finite-precision. In this case, a reliable RLS implementation will usually require a higher 
precision than a similar LMS implementation. What this means is that, when comparing 
the computational costs of LMS and RLS in terms of the required number of additions and 
multiplications, these numbers should be interpreted in light of the fact that the calculations 
used by RLS should generally be of higher precision. 


TABLE 14.2 Estimated computational cost of the RLS algorithm per iteration for complex-valued 


data in terms of the number of real multiplications, real additions, and real divisions. 


Term 

UiWi-1 

d(i) — UiWi-1 
Al 

Pi- (A ‘uz 

wi Pi-1(A7* ut) 
1+ u Pii(A tu?) 
1/01 + wi Bi (7 ut)] 
(A7 Pj iuf) 


-i x A  uwPi-iui 
Q7 Poi) * T eta P aur 
Piui 
Piu; [d(i) = UiWi—1] 

Wi 


TOTAL per iteration 


Later in the book (e.g., Chapters 37—43) we shall develop more efficient variants of RLS 
that require the same order of computations as LMS and e-NLMS. These variants will have 
the same advantage as RLS in terms of faster convergence. However, their algorithmic 
descriptions will be more involved than what we have seen so far for LMS, e-NLMS, and 


even RLS. 


lruBa(-uwl 


x 
4M 


4M? 
4M 


1 


2M 


4M 


4M? +16M +1 


+ / 
a DONE ES 


M(4M - 2) 
4M —2 
1 


4M?+12M-1]1 
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Summary and Notes 


pp part of the book deals with the basic principles of steepest-descent methods. The key results 
are the following. 


SUMMARY OF MAIN RESULTS 


Steepest-Descent Algorithms 
1. Consider a zero-mean random variable d with variance c2 and a zero-mean random row 


202 


vector u with R, = Eu*u > 0. The solution of the linear least-mean-squares estimation 
problem 
min E|d — uw[? 
w 


is given by wu? = R} Rau. The cost function J (w) = E |d — ww? is quadratic in w and has 
a unique global minimum at w°. 


. The optimal solution w° can be determined recursively via a steepest-descent method. Start 


with any initial guess w_1 (e.g., w-1 = 0) and iterate 
wi = wi-1t+ ul Ray — Ruwi-i], $20 


The successive weight estimates w; are guaranteed to converge to w^ as long as the step-size 
p satisfies 0 < u < 2/Amax, Where Amax denotes the maximum eigenvalue of R,. Fastest 
convergence is attained by choosing 4° = 2/(Amax + Amin), where Amin denotes the smallest 
eigenvalue of Ru. Moreover, the convergence of wi to w? is exponential and controlled by 
the modes (1 — uA&) or by the time constants (—1/2In |1 — pAx|}. 


. We can also employ a steepest-descent method with a time-dependent step-size to estimate 


w°. Start with any initial guess w_1 (e.g., w-1 = 0) and iterate 
wi = wi-1 + u(i)Ra, — Ruwi-i], i20 


The successive weight estimates w; are guaranteed to converge to w° if the step-size sequence 
satisfies u(i) — 0 and 577°, u(i) = oc. Other conditions on the step-size can still guarantee 
convergence. Fastest convergence is attained by choosing 


p (i) = di, R? wir /Wi_y Ry Oii 


. The learning curve of a steepest-descent method is defined as J(i) = E|d — uwi-i|?. In 


other words, the value of the learning curve at a particular iteration 7 is a measure of the cost 
that would result if we freeze the weight estimate at the value obtained at the prior iteration. 


. The contours of the quadratic cost function J(w) = E |d — uw]? are elliptic curves centered 


at w° and whose principal axes are the eigenvectors of Ra shifted to w°. 


. The optimal solution w° can also be determined by employing a Newton-type recursion or a 


Levenberg-Marquardt method, 


wi = wi-1 — u[d + RJ] [Rau — Ruwi-1], i20, w-1 = initial guess 
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where e is a small positive number. 


; , M Part Ill 
7. For more general cost functions J (uw) that are not necessarily quadratic in w, steepest-descent SUMMARY 
methods can be used to attempt to estimate a minimizing argument of J(w) as follows: AND NOTES 


wi = wi-i - u[VsJ(wi-i))" 
with sufficiently small step-sizes jz, or even 
wi = wi-i- ule + Vi J(wi-1)] | [VuJ(wi-1i)] 


in terms of the gradient vector and the Hessian matrix of the cost function. In these more 
general cases, there is no guarantee beforehand that the weight estimates will converge to a 
global minimum of the cost function. 


Stochastic-Gradient Algorithms 

This part also describes the procedure for developing stochastic-gradient approximations from steepest- 
descent methods. The key step is to replace true gradient vectors and Hessian matrices with instan- 
taneous approximations. These substitutions achieve three objectives: 


1. They free the designer from the need to know beforehand the underlying signal statistics. 


2. They equip stochastic-gradient algorithms (and the corresponding adaptive filters) with a 
learning mechanism that allows them to learn the statistics of the underlying signals. Dif- 
ferent algorithms learn at different rates and with different accuracies. 


3. They also equip stochastic-gradient algorithms with a tracking mechanism that allows them 
to track variations in the signal statistics. Again, different algorithms track at different rates 
and with different degrees of success. 


We described several stochastic gradient methods and commented on some of their properties: 


a) LMS and e—NLMS are the most widely used adaptive filters in current practice due to their 
simplicity, robustness, and low computational complexity. The e—-NLMS algorithm has the 
advantage of relying on a search direction that is essentially insensitive to the norm of the 
regression vector. This property is useful when dealing with signals that undergo periods of 
activity and pause (e.g., speech signals). 


b) The RLS algorithm is an order of magnitude costlier than LMS-type algorithms, requiring 
O(M?) vs. O(M) operations per iteration. However, RLS converges significantly faster than 
LMS. Several chapters in this book are devoted to RLS (e.g., Part VII (Least-Squares Methods) 
through Part X (Lattice Filters) where several efficient implementations of RLS are described). 


c) Although LMS and RLS have been motivated in this chapter as stochastic-gradient algorithms, 
they can also be derived as exact solutions to well-defined optimization problems. These 
problems are described at length in later stages of the book (e.g., Chapters 30 and 45). In 
the case of RLS, the algorithm will be shown to be the recursive solution of a regularized 
least-squares problem, while LMS will be shown to be the recursive solution of an indefinite 
least-squares problem. 


d) Several other LMS-type algorithms are described in the chapter, including sign-error LMS, 
leaky-LMS, LMF, LMMN, CMA, RCA, and MMA. As will become clear from the analyses in 
Parts IV (Mean-Square Performance) and V (Transient Performance), and from the problems 
therein, some of these algorithms can exhibit superior performance to LMS for some signal 
statistics. In addition, CMA, RCA, and MMA are particularly suited for applications where 
reference signals are absent (e.g., in blind equalization applications). 


BIBLIOGRAPHIC NOTES 


Iterative methods. There is a huge amount of literature on iterative methods for solving lin- 
ear systems of equations (see, e.g., Faddeev and Faddeeva (1963), Traub (1965), and Golub and 
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Van Loan (1996)) and also for solving linear and nonlinear optimization methods (see, e.g., Wilde 
(1964), Wilde and Beightler (1967), Hestenes (1975), and Fletcher (1987)). The steepest-descent 
and Newton schemes that we described in this chapter are special cases of such iterative methods, 
when applied to the solution of linear equations or to the minimization of quadratic cost functions. In 
order to illustrate this connection, we show below how the steepest-descent recursion (8.20) can be 
alternatively motivated as an iterative method for the solution of the normal equations Ryw? = Rau. 


Linear equations. Consider an arbitrary linear system of equations Az = b and assume we can 
express A as A = Q — N for two matrices Q and N, chosen such that Q is easily invertible and 
the eigenvalues of Q^! N are all inside the unit circle. Then the equality Ax = b can be written as 
Qz = Nz +b, which leads tox = Q^! Nz J- Q^!b. An iterative method for finding 2 then replaces 
this equation by the recursion 


zi = QUNa-1+ Q^ b, i20 


whose iterates (z;) would converge to x, no matter what the initial condition z~; is, since all the 
eigenvalues of Q^! N lie inside the unit circle. 
To apply this construction to Ry w^ = Rau, we proceed as follows. We first express R, as 


R,- uI- (WI - R) 2 Q- N 
Q N 


for some positive constant x to be chosen, and for matrices (Q, N} defined as above. Then the 
normal equations can be rewritten as 


Qu? = Nw? + Rau 


or, equivalently, as 
pow? = (wT — Ru)w? + Ra, 
which simplifies to 
w? = w? + u[Ra, — Riw*| 


We can now replace this equality by the recursion 
wi = wi-1 + [Rau — Ruwi-1] 


and choose p such that the eigenvalues of Q^! N = I — wR,, all lie inside the unit disc. This is 
of course the same steepest-descent method that we encountered before in (8.20), with the condition 
H € 2/Amax guaranteeing a stable matrix Q^! N. 


Stochastic approximation. The idea of using iterative procedures that are based on sample 
realizations in order to simultaneously approximate actual expectations and minimize a certain cost 
function is at the core of what is known as stochastic approximation theory. The theory can be 
succinctly described as follows. Assume real-valued data and consider the cost function J(w) — 
E [f (a, w)], for some function f and stochastic data x. The stochastic-approximation method for 
minimizing J(w) over w takes the form (see, e.g., Tsypkin (1971, p. 47)): 


Wi = Wi-1 — MOLA [f (zi, wii] 


with f(-, -) evaluated at a realization for x at iteration 7, and where V „ f denotes the gradient vector 
of f with respect to w. Moreover, u(i) is a step-size sequence, possibly matrix-valued and also 
time-variant. It is clear that such constructions have a close relation with adaptive filter design. For 
example, in the adaptive filtering context, we usually have f(z,w) = (d — uw)? with {d, u} 
playing the role of x. Then 


Vu [f(zi wi-i)) = ~ui [d(i) — uiwi-i] 


in terms of realizations {d(i), wi} for (d, u}; the LMS filter would follow by fixing the step-size at 
a constant value. 


According to Tsypkin (1971, p. 70), the pioneering work in the field of stochastic approxima- 
tion is that of Robbins and Monro (1951). Although their recursive procedure was a variation of a 
scheme developed two decades earlier by von Mises and Pollaczek-Geiringer (1929), the work by 
Robbins and Monro generated tremendous interest and led to many subsequent studies and exten- 
sions. While Robbins and Monro’s (1951) work dealt primarily with a scalar weight, Blum (1954) 
and Schmetterer (1961) extended the procedure to weight vectors. A description of these develop- 
ments can be found in the book by Wetherhill (1966). 


LMS adaptation. During the 1950s, stochastic-approximation theory did not receive much atten- 
tion in the engineering community until the landmark work by Widrow and Hoff (1960) in which 
they developed the real form of the LMS algorithm; the complex form of the filter appeared later in 
Widrow et al. (1975). The LMS filter is considered, in many respects, to be the birthmark of modern 
adaptive filter theory. Since its inception, the algorithm has been examined and scrutinized from all 
angles with remarkable resilience to the test of time. Very few algorithms in estimation and filtering 
theories have found so much success and have been used in so many widespread areas. 

Besides LMS, there have also been other works on adaptive structures in the early 1960s. One 
such example is an adaptive filter that minimizes the mean-square error between an input signal 
and a reference signal developed by Gabor, Wilby and Woodcock (1961); their filter is described in 
Tsypkin (1971, p. 156). This latter reference also contains on pages 172-173 commentaries on works 
on adaptation and learning during the early sixties, including a description of an adaptive filter by 
Sefl (1960) that is the continuous-time counterpart of LMS; it employs a differential update equation 
of the form dui 

out) — nr" (0 (t) - ulee) 
with continuous-time vector variables (w(t), u(¢)} and scalar variables (d(t), u(t)]. 

Other noteworthy works on adaptive structures in the 1960s are those by Applebaum (1966) and 
Widrow et al. (1967) on adaptive antenna arrays. In Applebaum (1966), an adaptive algorithm is 
derived that is based on maximizing the signal-to-noise ratio, while Widrow et al. (1967) focus on 
mean-square error performance and use the LMS algorithm. 


NLMS adaptation. The NLMS filter was independently proposed by Nagumo and Noda (1967) 
and Albert and Gardner (1967); although the actual terminology of normalized least-mean-squares 
algorithm seems to be due to Bitmead and Anderson (1980). A simplified version of the least- 
perturbation property (11.12) for NLMS was studied by Goodwin and Sin (1984) assuming p = 1 
and e = 0, in which case the constraint in (11.12) simplifies to r(¢) = 0. In Nitzberg (1985), 
an interesting connection is made between NLMS and LMS whereby it is shown that the NLMS 
algorithm can be obtained from applying LMS repeatedly for every new sample of data. 


LMS variations. Since its creation, several modifications have been proposed to the LMS original 
formulation, such as replacing the error signal or the regressor by their signed versions in lieu of 
computational simplicity. These ideas seem to have been first suggested by Lucky (1965) in the con- 
text of adaptive equalization. Other early works are those by Gersho (1969,1984), Hirsch and Wolf 
(1970), Claasen and Mecklenbráuker (1981), and Duttweiler (1982). These investigations resulted in 
the sign-error LMS form (Alg. 12.1), the sign regressor LMS form (see Prob. V.25), and the sign-sign 
LMS form where both the error signal and the regressor are replaced by their signs, i.e., for real data, 


wi = wi-1 + usign[u] sign[d(i) — uiwi-i] —(sign-sign LMS) 


Unfortunately, however, the sign-error and sign-sign LMS algorithms converge much slower than 
LMS. In Duttweiler (1982), it was further suggested that in calculating the filter update of LMS, the 
quantity e(2) and the entries of u; could be quantized to the nearest power-of-two-values with the 
resulting algorithm still having a similar performance to that of LMS. 

One variant that has improved convergence performance over sign-error LMS is the dual-sign 
LMS algorithm (see, e.g., Sethares and Johnson (1989) and Mathews (1991a)). In this variant, rather 
than only replace the error signal by its sign, the magnitude of the error is also examined and if it is 
deemed to be large, then the sign of the error is scaled up by a constant larger than one, e.g., for real 
data, 

wi = wi-1 + wyulsign[e(1)], e(i) = d(i) - ujwi-i  (dual-sign LMS) 
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where y is a positive constant chosen as follows: 


~y > 1 and chosen as a power of two if |e(i)| > threshold 
y=1 if |e(2)| € threshold 


Large values of ^ tend to increase the convergence speed at the expense of performance degrada- 
tion. On the other hand, a large threshold tends to improve performance at the expense of slower 
convergence. 

A second variant of sign-error LMS is the power-of-two error LMS algorithm (see Xue and Liu 
(1986) and Eweda (1992)). In this case, if the magnitude of the error is larger than one, then it 
is replaced by its sign. Otherwise, its value is quantized to the nearest power-of-two-value in the 
following manner: 


wi = wi-1 + pyu sign[e(i)}, e(i) = d(i)- u;wi-i — (power-of-two error LMS) 


where ^y is again a positive number chosen as follows. Let B denote a desired number of bits (ex- 
cluding the sign bit). Then 


1 if |e()| > 1 
y= eq afillosa tell ig 27B+ < le(i)| <1 
2-8 if |e(i)| < 277+ 


where the so-called floor function, fl[z], returns the closest integer that is smaller than its argument. 
The choice of y as 277, when the error signal is small, guarantees that the filter will not stop 
adapting. Alternatively, we could have set the value of y to zero when |e(i)| < 273+}. In this 
case, however, the filter update will stop when the error becomes small. 


Affine projection algorithms. Although LMS and NLMS are among the most widely used adap- 
tive filters due to their computational simplicity and ease of implementation, colored input signals 
can deteriorate their convergence speed — see, e.g., App. 9.D of Sayed (2003). To address this prob- 
lem, Ozeki and Umeda (1984) developed the basic form of the affine projection algorithm (APA). 
Since then, many variants of APA (also known as data-reusing algorithms) have been devised from 
different perspectives such as the partial-rank algorithm (PRA) by Kratzer and Morgan (1985), the 
generalized NLMS algorithm by Morgan and Kratzer (1996), the decorrelating algorithm by Rupp 
(19982), the APA with orthogonal correction factors (referred to as NLMS-OCF by Sankaran and 
Beex (1997,2000)), and the binormalized data-reusing LMS algorithm (which corresponds to the 
special case K = 2 of APA — see Apolinário, Campos, and Diniz (2000)). An early analysis of 
data-reusing algorithms appears in Roy and Shynk (1989). A more recent study of an APA family 
appears in Shin and Sayed (2004). For applications of APA to acoustic echo cancellation, see the 
book by Benetsy et al. (2001). 


Varlable step-size LMS. As we are going to see in Chapters 16 and 24, the step-size in LMS adap- 
tation controls the rate of convergence and the steady-state performance of the filter. In order to meet 
the conflicting requirements of fast convergence and good steady-state performance, the step-size 
needs to be controlled. Various schemes for controlling the step-size of LMS have been proposed 
in the literature, for instance, by Kwong and Johnston (1992), Mathews and Xie (1993), Aboul- 
nasr and Mayyas (1997), Pazaitis and Constantinides (1999), and Shin, Sayed, and Song (2004a). 
Kwong and Johnston (1992) use squared instantaneous errors, while Aboulnasr and Mayyas (1997) 
use the squared autocorrelation of errors at adjacent times, Pazaitis and Constantinides (1999) adopt 
the fourth-order cumulant of the instantaneous error, and Shin, Sayed, and Song (2004a) attempt to 
maximize the decrease in the weight-error-vector energy. The step-sizes in these variants are evalu- 
ated in the manner shown in Table 14.3 (for the case of real data). 


Blind algorithms. Blind algorithms compensate for the lack of an explicit reference signal. They 
are pointedly referred to as adaptive algorithms for restoring signal properties by Treichler, Johnson, 
and Larimore (2001). One of the first “blind” algorithms in the context of signal processing is that of 
Griffiths (1967), who replaced the product u7d(i) in the LMS update (10.10) by Rau, in which case 


TABLE 14.3 Variable-step-size LMS implementations. 


VSS-LMS u(t) = apli — 1) + ye?(i) 

Kwong and Johnston (1992) 

RVS-LMS p(t) = apli — 1) + yp? (i) 

Aboulnasr and Mayyas (1997) pli) = 8p(i — 1) + (1 — 8)e(i)e(i — 1) 


KVS-LMS u(i) = Umax ( 1 — e-2C&() ) 


Pazaitis and Constantinides (1999) | C$ (i) = f(i) — 3p? (i) 
f() = 8f —1) + (1 - Be*(i) 
i) = Bp(i - 1) + (1 — 8)e?(i) 


max _ [pill 
VSS-NLMS KÒ = TE e+ pal 


Shin, Sayed, and Song (2004a) c#1/SNR 


pi = Bpi-rt+(1- 4) ml? zeli) 


the update for w; would become 
wi = wi~ı + Has — puj2(i) 


where z(i) = ujwi-i is the output of the filter. In this implementation, the absence of a reference 
signal is compensated for by using Rau, and there are applications where Rau can be computed 
beforehand — see Treichler, Johnson, and Larimore (2001) for an example in the context of interfer- 
ence suppression. 

The first truly blind algorithm is due to Sato (1975), who developed his algorithm for blind equal- 
ization purposes. In Sato’s algorithm, all signals are real-valued and the data to be recovered belongs 
to a BPSK constellation; its update equation has the form 


wi = win uui [ysign[z(i)] — 2()), 2) = uiii 


which is seen to be a special case of CMA1-2 and RCA for real-data. Actually, RCA extends Sato's 
algorithm to the complex domain (see Prob. IIL.34). Although simple to implement, the Sato and 
RCA methods tend to face convergence difficulties. 

The more general constant-modulus algorithms described in Algs. 12.5 and 12.6 were developed 
by Godard (1980) and Treichler and Agee (1983). These authors arrived at the same algorithms, and 
the only minor difference between their formulations is with regards to the interpretation of the con- 
stant y. The motivation of Treichler and Agee (1983) was to equalize constant-envelope signals. For 
this reason, they chose the constant "y as the value of the constant amplitude. Interestingly, however, 
constant-modulus algorithms can still function properly even if the underlying data do not possess 
the constant-modulus property (e.g., data arising from a QAM constellation). This fact was noted 
earlier by Godard (1980), whose motivation was to perform blind equalization on such disperse con- 
stellations. For this reason, the constant ~y in Godard (1980) was chosen as some statistical measure 
of the dispersion in the data (see Probs. IV.16 and IV.18 for further explanation). Compared with 
RCA, the CMA methods are more reliable. Nevertheless, their implementations are more complex 
since they tend to require the use of rotators at the output of the equalizers (see Treichler, Larimore, 
and Harp (1998)). 

Since these earlier contributions, much subsequent work has been done on blind algorithms, e.g., 
by Macchi and Eweda (1984), Benveniste and Goursat (1984), Foschini (1985), Picchi and Prati 
(1987), and Shalvi and Weinstein (1990). In Picchi and Prati (1987) a “stop-and-go” algorithm 
was introduced that employs a flag to halt adaptation if the error signal is deemed unreliable; the 
algorithm is examined in Computer Project III.4. In Yang, Werner, and Dumont (1997), a multi- 
modulus algorithm (MMA) was introduced that is suitable for two-dimensional modulation schemes 
(see Prob. III.35); this algorithm was studied in the context of broadband access systems in Werner 
et al. (1999), 
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The article by Johnson (1998) provides a survey of the developments in the area of blind algo- 
rithms until the late 1990s, including applications to fractionally-spaced equalization (see also Mai 
and Sayed (2000)). Constant modulus algorithms have also been used for adaptive beamforming 
(e.g., Keerthi, Mathur, and Shynk (1998)) and for interference cancellation (e.g., Kwon, Un, and Lee 
(1992)). 


Adaptive equalization. In the concluding remarks of Chapter 6, we commented on the early his- 
tory of channel equalization and how the mean-square error criterion was proposed for this purpose 
by Widrow (1966), Gersho (1969), and Proakis and Miller (1969). In this chapter, we described how 
channel equalizers could be designed adaptively, as opposed to the closed-form solution methods 
of Secs. 5.4 and 6.4. Historically, the LMS algorithm was developed at the right time, in the early 
1960s, right when interest in equalization was starting to build up (Lucky (1965)). Soon afterwards, 
the use of adaptive filters for equalization purposes was studied in greater detail by Gersho (1969), 
Proakis and Miller (1969), Ungerboeck (1972), and Salz (1973). In these adaptive implementations, 
an equalizer would have two modes of operations: a training mode (Gersho (1969)) whereby data that 
are known to the receiver and the transmitter are used to train the equalizer, and a decision-directed 
mode (Salz (1973)), whereby decisions are used to train the equalizer. 


Problems and Computer Projects 


PROBLEMS 


Problem Ill.1 (Multiple step-sizes) Refer to expression (8.18) and choose the matrix B as B = 
diag(b(1),b(2),... , b(M)) with b(z) > 0. In this case, the recursion (8.20) is replaced by wi = 
wi-1 + yB[Rau — Ruwi-i]. This scheme associates one step-size with each entry of the weight 
vector w. Follow the discussion in Sec. 8.2 and derive a necessary and sufficient condition on y in 
order to guarantee convergence of wi to w? = Ry Ra. 


Problem lll.2 (Product of infinitely many numbers) Consider a scalar recursion of the form 
MER 
z(i) = a(i)z(i — 1) for i > 0, and assume a(i) = e / 6*9, 
(a) Verify that O < a(i) < 1 for all finite i. 


—n?/6 


(b) Let p(i) = bes a(j). Show that p(i) converges to e , Which is a finite positive number. 


Hint: The series 3757 , (1/3?) converges to 1? /6. 


Problem Ill.3 (Optimal step-size) Refer to expression (9.16) for the optimal step-size. Verify 
that it is equivalent to the following: 


pa [VuJ(wi-1)) Ru [VuJ(wi-i)]" 


in terms of the squared Euclidean norm of the gradient vector in the numerator, and the weighted 
squared Euclidean norm of the same vector in the denominator. 


Problem Ill.4 (Convergent step-size sequence) Consider the steepest-descent algorithm (9.13) 
with a time-variant step-size. Assume that (i) converges to a positive value, say, u(i) — a > 0 as 
i — oc. Show that if a satisfies œ < 2/Amax, then wi converges to w°. 


Problem Il!.5 (Optimal step-size) Consider the optimal step-size (9.16) in the iteration-dependent 
case of the steepest-descent algorithm. Show that 1/Amax < u° (i) € 1/Amin, where Amax and Amin 
denote the largest and smallest eigenvalues of Ru. Conclude that $775, u° (i) = oo. 


Problem IIl.6 (Contour curves) Consider the steepest-descent algorithm (9.13) with the optimal 
time-variant step-size (9.16). 
(a) Show that J(w;) = J(wi-1) — p° (iJj R2À i. 
(b) Assume M = 2 (i.e. the size of w° is 2 x 1). Sketch the elliptic contours of constant 
mean-square error and explain how the optimized algorithm moves from one elliptic curve to 
another. 


Problem Ill.7 (Interfering signals) Refer to Prob. I. 14. Suggest a steepest-descent algorithm for 
estimating x; from y. 


Problem ItI.8 (Prediction problem) A zero-mean stationary random process (u(-)) is gener- 
ated by passing a zero-mean white sequence (v(-)) with variance c2 through a second-order auto- 
regressive model, namely, u(i) + au(i — 1) + Qu(i — 2) = v(t), i > —oo, where a and 8 are 
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210 — real numbers such that the roots of the characteristic equation 1 + az! + 82^? = O are strictly 
Part Ill inside the unit circle. We wish to design a second-order predictor for the process u(-) of the form 


PROBLEMS a(i) = | u(i—1) u(i—2) | w^, for some 2 x 1 vector w°. 
(a) Verify that (o, 8} must satisfy |8| < 1 and |a| < 1 + £. 


(b) Define the data vector u — [ u(i—1) u(i—2) | and the desired signal d = u(i). Let 
Ry = Eu*u and Rau = Edu*. Show that 


MN SENT 
[ Ru | ne ] oara -a 1+8 r 


Establish that (1 — 3)[(1+ 8)? — a7] > 0. 


(c) Show that the optimal weight vector w^ is given by w° = col{—a, —8}. Could you have 
guessed this answer more directly without evaluating the product RI ! Rau? 


(d) Verify that the eigenvalue spread of Ry is p = (8 + 1 + |jal)/(8+1-— |ol). Design a 
steepest-descent algorithm that determines w° iteratively. Provide a condition on the step- 
size ys in terms of (o, 9) in order to guarantee convergence. 


(e) Show that the value of the step-size that yields fastest convergence, and the resulting time- 
constant, are 


(1+ 6)? —o? die d =] 
E > 7 = 3in (lal/(@ +1) 


9 rer 


1-8 
dir. 


Problem Ili.9 (Logarithm function) Establish that the logarithm function satisfies the following 
properties: 


(a) For any y > 0, it holds that In(1 — y) € —y. 
(b) There exist 0 < b < 1 and a > 0 such that —ay < In(1 — y) for any y € [0, b]. 


Problem ill.10 (Convergence proof) The purpose of this problem is to establish the statement 
of Thm. 9.1. Let Ŭŭ; = w° — w; and introduce the eigen-decomposition Ry = U AU". Define the 
transformed vector z; = U" p; and let x, (1) denote its k—th entry. 


(a) Assume first that the step-size sequence is divergent and let us establish that w; converges to 
w°. Thus since A, > 0, and using the fact that u(i) — 0, conclude that there exists a large 
enough i, such that 0 < 1 — p(i)àx < 1 for all k and for all i > io. Use the result of part (a) 
of Prob. III.9 to conclude that )>%° In(1— u(j)Ax) = —oo. 


j=io+1 
(b) Verify that z4 (1) satisfies the recursion 7, (2) = (1 — u(i)àk)xr(i — 1) and conclude that 


l lz] _ 
lim ln nM = -%0 


and, consequently, x; (7) — 0. Remark. The argument in parts (a) and (b) shows that if the step-size 
sequence is divergent then Ù; — 0. We now examine the converse statement. 


(c) Let us now assume that w; converges to w? and let us establish that the step-size sequence is 
divergent. Thus assume z(t) — O and show that it implies 5772, 41 In(1 — u(j)À«) = 
~o. 

(d) Assume that £, is large enough so that not only 0 < 1 — 4(j)A& < 1 forall k and j > io, but 
also u(j)Àx < b. Use the result of part (b) of Prob. III.9 to conclude that 


-as V^ ug) € Y, mü-4)«) = -oo 
j=io+1 j=io+1 


That is, conclude that the sequence {1(2)} is divergent. 
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Problem Ill.11 (Regularized Newton's method) Consider the steepest-descent recursion of Re- PROBLEMS 


mark 2, 


Remark. The above arguments are patterned along those given in Macchi (1995, pp. 65-67). 


wi = wii — ulel + VÀJ(wi-i1) | [VJ(wi-1), w-i = initial guess 


for some e > 0. For the quadratic cost function J(w) of (8.8) we have V2,J(wi-1) = Ru > 0 and 
Vwd (wi-1) = (Rau — Ruwi-1)*. 
(a) Show that a necessary and sufficient condition on jz that guarantees convergence of w; to the 
minimizing argument of J(w) is 0 < u < 2+€/Amax- 
(b) Find the optimum step-size ° at which the convergence rate is maximized. 


Problem Ill.12 (Leaky variant of steepest-descent) Consider the modified optimization prob- 


lem 
min [^to à owl? + E|d - uw?] 


where a is a positive real number and J? (w) is the new cost function (it is dependent on o). In the 
text we studied the case œ = 0 (see expression (8.7) for J(w)). The above modified cost function 
penalizes the energy (or squared norm) of the vector w, and is therefore useful in situations where 
we want to avoid solutions with potentially large norms. 
(a) Show that the optimal solution is given by w% = [Ru + oT] ^! Ray. Compute the resulting 
minimum cost, J” (w?), and show that J°(w%) > o2 — Rua Ru | Rau = Jmin, Where Jmin 
is the minimum cost associated with J(w) (cf. (8.11)). 


(b) Let dw = w% — w° denote the difference between the new solution w^ and the linear least- 
mean squares solution w° of (8.4). Show that Rugdw = Jmin — J (w°). 


(c) Justify the validity of the following steepest-descent method for determining w^: 


we —(1-uo)wi + u[Ra, — Ruw],  w?; = initial guess 


Show that w? converges to w^ if, and only if, 0 < u < 2/(Amax + @). Show also that the 
optimal step-size for fastest convergence is u^ = 2/(Amax + Amin + 2a). 


(d) Let u° denote the optimal step-size choice for the standard steepest-descent method (with 
a = 0, i.e., cf. (9.5)). Compare yu? and p°. 

Remark. We see from the result in part (a) that the inclusion of the term a||w||? in the cost function J^ (uw) has 
the effect of modifying the input covariance matrix from Ry to Ru + al. This can be interpreted as adding a 
noise vector v to u, with the individual entries of v arising from a zero-mean white-noise process with variance 
a. This process of disturbing the input u with entries from a white-noise sequence is known as dithering. Its 
effect is to provide a mechanism for controlling the size of the solution vector w®; it also results in a covariance 
matrix with smaller eigenvalue spread (see Prob. III.13). Its disadvantage is that the optimal solution w® will be 
distinct from the desired solution w°. 


Problem Ill.13 (One effect of dithering) Consider the same setting as in Prob. III.12. Show 
that the eigenvalue spread of Ru + al is smaller than the eigenvalue spread of Ru. 


Problem Ill.14 (1; -norm of complex variables) Let z be a nonzero real-valued variable. Ver- 
ify that, for z # 0, 


ael = sign(z) where  sign(z) =i r«0 


^ l +1 cud 

[At z = 0, we shall define from now on sign(0) = 0.] Now assume z is complex-valued with real 
and imaginary parts denoted by x, and zi, respectively. Define its /; -norm as follows |z| = |z,-|+|zil. 
That is, we add the absolute values of its real and imaginary parts. [The result can be interpreted as 
the /; -norm of the vector col{z,, xi}.] Show that the complex gradient (cf. Chapter C) of |x| with 
respect to x Æ 0 is given by 


die. 


dr = g benle) - jsign(n)] 


bole 


212 Problem III.15 (Sign-error algorithm) Consider two zero-mean random variables d and u where 
Part lil d is scalar-valued and u is a row vector. We are interested in minimizing the expected value of the 
PROBLEMS l;-norm of the error e = d — ww (cf. Prob. III.14), i.e., min, E [e]. 


(a) Let e = e, + jei. Show that the complex-gradient of the cost function J(w) = E|e| with 
respect to w is given by VwJ(w) = —E (u [sign(e:) — jsign(ei)]) /2. 


(b) Conclude that a steepest-descent method can be obtained via the recursion 
Wi = Wi-r + Eu’ (sign[e:(i)] + jsign[e;(z)]} 


for some positive step-size u and where e(i) = d — uwi-i = er(i) + jei(i). 
Remark. A more compact representation can be obtained if we define the sign of a complex number as 


follows: x 

esgn(x) = sign(2r) + jsign(2) 
Observe that we are writing csgn instead of sign; we reserve the notation sign for the sign of a real 
number. With this definition, we can rewrite the above steepest-descent recursion as 


wi = wi-i- pE u*csgn[e(i)] 


in terms of e(i) and for a scaled step-size, which we still denote by y. 


Problem Ill.16 (Least-mean-fourth (LMF) criterion) Consider zero-mean random variables d 
and u, with u a row vector. An optimal weight vector w° is to be chosen by minimizing the cost 
function 

min Eld — uu|?* 


for some positive integer L. Observe that we are now minimizing the moment of order 2L of the 
error signal rather than its variance, as was done in Sec. 8.1. 


(a) Argue that the cost function E |d — uw!” is convex in w and that, therefore, it cannot have 
local minima. 


(b) Assume in this part that (d, u, w} are real and scalar-valued. Assume further that d and u 
are jointly Gaussian and that L = 2. Verify that the cost function in this case reduces to 


J(w) = 304 — 12e2o4,w + 6(0203 + 202, )u? — 1202 cauw® + 304w" 
where o2 = E u?, og = Ed? and cq, = Edu. Show that w? = oau/o2 is a minimum of 
J(w). Is it a global minimum? Assume c2 = 1, 02 = 1 and oq, = 0.7. Plot J(w). 

(c) Let z be a complex-valued variable and consider the function f(z) = |z|?". Verify that 
df/dz = Lz*|z|?4-, 

(d) Let e = d — ww and denote the cost function by J (w) = E |e|?*. Use the composition rule 
of differentiation to verify that V. J(w) = —L Eue*|e| 79, Conclude that a steepest- 
descent implementation for finding a minimizing solution of J(w) is given by (note that we 
blended the constant L into p): 

wi = wii + pEu*e(i) [e(i)| 7? 
for some step-size u > 0 and where e(i) = d — uwi-i. 


Remark, When L = 2, the criterion corresponds to solving a least-mean-fourth (LMF) estimation problem. The 
recursion in this case would take the form 


& 


i = wi-1 + pE u*e(i) Je(i)|? 
Note also that when all variables are real-valued, the recursion of part (d) reduces to 


wi = wi-i- pE ul e?L-1(i) 


wi = wi-1 + pE ul e3(i) 


and in the LMF case it becomes 
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Problem Ill.17 (Least-mean-mixed-norm (LMMN) criterion) Consider zero-mean random vari- PROBLEMS 


ables d and u, where u is a row vector. An optimal weight vector w° is to be chosen so as to minimize 
the cost function 


min E | ðļel? + 5 - 3e 


where e = d —uw and 0 € 6 < 1. Observe that this cost function reduces to the least-mean-squares 
and least-mean-fourth criteria at the extreme points 6 = 1 and ô = 0, respectively. Other values of 6 
allow for a tradeoff between both criteria. Verify that 


V. J(w) = -Eu[óe* + (1- é)e*]e? ] 
Conclude that a steepest-descent implementation for finding a minimizing solution of J(w) is given 


by 
Wi = Wi—-1 + uEu*e(i) [6 + (1 — ITO ] 


for some step-size 4 > 0 and where e(i) = d — wwi—1. Remark. For real-data, the above recursion 
reduces to 
wi = wi-1 + HE ul [| õel) + (1 — 6)e3 (4) ) 


Problem Ill.18 (Constant-modulus criterion) Let u be a zero-mean row vector and w an un- 
known column vector that we wish to determine so as to minimize the cost function J(w) = 


E (y — |uw!?)’, for a given constant number +. 
(a) Let Ry = E u*u. Show that 
J(w) = 3! + w*[-27Ru + E (uu juw? )w 


and 
Vud(w) = -2w* (yR, - E (u*u|uw|))) 


Conclude that a steepest-descent algorithm for the minimization of J(w) is given by 


wi = Wi- +H ( yRuwi-ı — E [ u*uwii|uwi-il? ] ) 


for some step-size u > 0. 


(b) Assume that w is two-dimensional, say with entries w = col(o, 8), and that u is a circular 
Gaussian random vector (cf. Sec. A.5) with Ry = diag{1, 2}. 


(b.1) Verify that E|uw|? = |a|? + 2/9|? and E |uw|* = 2(la|? + 21812)”. Conclude that 
J(w) = + 2(lo]? + 218p? — 2y(|a|? + 2|G|?). Remark. If z is a zero-mean scalar 
complex-valued Gaussian random variable with variance E |z|? = o2, then its fourth-moment is 
E |z|^ = 2c4. If z were real-valued instead, its fourth moment would be E |z|* = 362. Verify 
these claims. 


(b.2) Conclude also that VuJ(w) = 2(—7 + 2lal? 48?) | at 2" |. Argue that 
J(w) has a local maximum at the point a = @ = 0 and global minima at all points 
lying on the ellipse |o? +2|4|? = y/2. Argue further that the minimum value of J(w) 
is equal to y?/2. 


(b.3) Conclude that the steepest-descent algorithm of part (a) reduces in this case to the form 
ali) | | ali- 1) z s 42 432 a(i — 1) 
| a | - | Da | + u(y ~ Hati- f - 46 - Df) | ne | 


and that the weight estimates are always real-valued if the initial condition is real- 
valued. This problem is pursued further ahead in a computer project. 
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Remark. By minimizing the cost function J(w) we are forcing the magnitude of uw to be close to the constant 
V7; hence, the name constant-modulus criterion. Such cost functions are used in blind channel equalization 
applications — see Computer Project III.4. 


Problem Ill.19 (Reduced-constellation criterion) Let u be a zero-mean row vector and w an 
unknown column vector that we wish to determine so as to minimize the cost function 


Hay E [zlew - juu 


for a given constant number ~y and in terms of the /;-norm of ww, as defined in Prob. III.14. Let 
Ry = Eu*u. 


(a) Show that a steepest-descent algorithm for the minimization of J(w) is given by 


for some step-size u > 0. 


(b) Show that minimizing J'(w) = E |uw — y: csgn(uw)|? is equivalent to minimizing the 
cost function above. Conclude that J(w) is attempting to minimize the distance between uw 
and the four points {+y + jy}. 


Remark. Compare the above conclusion with the constant-modulus criterion of Prob. III.18, which attempts to 
minimize the distance between ww and all points on the circle of radius \/y. Both criteria are useful for blind 
channel equalization. 


Problem lil.20 (Multi-modulus criterion) Let u be a zero-mean row vector and w an unknown 
column vector that we wish to determine so as to minimize the cost function 


J(w) = E { [(Re(uw))? - 4 + [(Im(uw))? - 4]? } 


for a given constant number y, and where Re(-) and Im(-) denote the real and imaginary parts of 
their arguments. Let Ru = E u*u. Show that a steepest-descent algorithm for the minimization of 
J(w) is given by 


a = Re(uwi-1), b = Im(uwi-i), eli) = aly — a7] + jb|y ^ b°], wi = wi-s + pE ueli) 


for some step-size > 0. Remark. By minimizing the cost function J(w) we are forcing the real and 
imaginary parts of uw to be close to the values +,/7. In other words, J(w) attempts to minimize the distance 
between uw and the horizontal and vertical lines located at +,/7. 


Problem 111.21 (Another constant-modulus criterion) Let u be a zero-mean row vector and 
w an unknown column vector that we wish to determine so as to minimize the cost function 


J(w) = E [zw £ L 


for a given constant number ^ and in terms of the magnitude of uw. Let Ru = Eu*u. 


(a) Show that a steepest-descent algorithm for the minimization of J(w) is given by 


Wi = Wi-1 — H | Ruwi-1 — yE (ee) 


[uw] 


for some step-size 4 > 0. 


(b) Show that minimizing J'(w) = E (y — |uw!)? is equivalent to minimizing the cost function 
above. Conclude that J(w) is attempting to minimize the distance between |uw| and the 
circle of radius y. 


Problem lil.22 (Constrained mean-square error) Consider the constrained optimization prob- 
lem described in Prob. 11.36, namely, 


min E|d— uw|? subjectto ctw =a 
uw 


where c is a known M x 1 vector and a is a known real scalar. Verify that a steepest-descent recursion 
for estimating the optimal solution w° is given by 


Wi = Wi-1 +H k- Te ad (Rau — Ruwi-1) 


where R, = Eu*u > 0 and Ra, = Edu". Hint: Use the extended cost function Je(w, A) of Prob. 11.36 
and enforce the condition that the successive weight vectors (w;), including the initial condition, must satisfy 
the constraint c*w; = o. 


Problem Ili.23 (Homogeneous difference equation) Consider the homogeneous difference 
equation x; = Az;_1 with arbitrary initial vector z 1. Show that x; tends to the zero vector if, and 
only if, all the eigenvalues of A have strictly less than unit magnitude. Hinr: One possibility is to use the 
Jordan decomposition of the matrix A — see Prob. H.2. 


Problem III.24 (Power estimate) Refer to (11.9) and assume that the entries {u(j)} arise from 
observations of a white random sequence (u(j)) with variance o2. In this way, the quantity p(i) 
can be interpreted as a realization of a random variable p(i) satisfying p(i) = 8p(i — 1) + (1- 
B)\u(4)|?, p(—1) = 0. Show that Ep(i) — c2 asi — oo. 


Problem 1i1.25 (Perturbation property of &-NLMS) Follow the derivation in Sec. 10.4 and show 
that the solution to the optimization problem 


A s ag ad2 ; A = idu 
min [wi — wi-ill  subjectto r(i) = (1 c4 uj? e(t) 


is given by the e—NLMS recursion (11.4). 
Problem Ili.26 (Leaky-LMS) The leaky-LMS algorithm is the stochastic gradient version of the 


steepest-descent method of Prob. IIL.12. Replace Rau by d(i)u;, Ru by uj ui, and verify that this 
leads to the following recursion: 


w? = (1 — pa)w, + wu; (di) - ww], i20 


Define the a posteriori and a priori output errors r^ (i) = d(i) — u;w;? and e? (i) = d(i) — uiw% 1. 
Verify that r° (i) = [1 — ap — pl|us||?] e° (i) + aud(i). 


Problem III.27 (Drift problem) Assume the regressors u; are scalars, say u(i), and given by 
u(i) = 1/vi +1. Let u = landd(i) = u(i)w? + v(i) with w° = 0 and v(i) = c for all 


(a) Verify that the LMS update for the weight estimate w(i) gives 


i+] 
c 2c 

j|) = — > —wWi 

w(i) i 12, 2 3 itl 


and, therefore, w(i) — oo as i — oo (no matter how small c is). 


(b) Consider instead the leaky-LMS update of Prob. III.26 for a generic step-size 4, 


w?^(i) = (1- ua - £) wee 


Show that this recursion results in a bounded sequence {w° (i)) if 0 < p < 2/(a + 1). 
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Remark. This example shows that the weight estimates computed by LMS can grow slowly to large values, even 
when the noise is small — see Prob. IV.39 for a more detailed example. 


Problem III.28 (Constrained LMS) Refer to the steepest-descent algorithm of Prob. III.22, which 
pertains to the constrained optimization problem 


min E|d — ww|? subjectto c*w =a 
uw 


where c is a known M x 1 vector and o is a known real scalar. Verify that a stochastic-gradient 
algorithm for approximating the optimal solution w° is given by 


Wi-wiictu i — Wu uj [d(i) — uiwi-i] 


in terms of realizations {d(i), ui} for (d, u}, and starting from an initial condition that satisfies 
c'w-i-a. 


Problem İll.29 (Sign-error LMS) Refer to the statement of Prob. IIL.15. Show that the corre- 
sponding stochastic-gradient method is given by the following so-called sign-error LMS algorithm, 


wi = wi-1 + uuicsgn[d(i) — ujwi-1, i20 


where the complex-sign function is as defined in (12.1). 


Problem III.30 (LMF algorithm) Refer to Prob. IIT.16 and assume L = 2. Argue that a stochastic 
gradient implementation is given by 


wi = wi-1 + puže(i)jeli)l?,  e(i-ed(i-uw;i i20 


Problem lll.31 (LMMN algorithm) Refer to Prob. III.17. Argue that a stochastic-gradient imple- 
mentation is given by 


wi = wi- + pute(i)[6+(1—Ale(i)|"], eli) = d(i) — uiwi-1, i20 


Problem lll.32 (Least mean-phase (LMP) algorithm) Let d and u be zero-mean random vari- 
ables, with d being a scalar and u a 1 x M vector. Introduce the phase-error cost function J (w) = 
E |phase(d) — phase(uw)|" = E|Zd — Zuw|", where m = 1,2 and w is an unknown weight 
vector to be estimated. Consider further the squared-error cost function J,(w) = E|d — uw|? 
and let J(w) = ki Js (w) + kJ (w), where k and ke define the contribution of each term to the 
overall cost function. 


(a) Verify that for m — 2, a stochastic gradient implementation is given by 
wi = wii + piu; (dli) — uiwi-1) + ua(Zd(i) — Zuiwi-1) " 
where 141 and u2 are step-size parameters. 
(b) Likewise, verify that m — 1, we obtain instead 
jui 
(uswi-1)* 


Wi = Wi-1 + piu; (d(i) — UiWi-1) + pasign(Zd(i) — Luiwi-1) 


Remark. In some applications, the squared-error is not the primary parameter affecting the performance of the 
system. The information bits may be carried over the phase of the transmitted signal, in which case it is useful 


to consider cost functions that relate to both the error magnitude and the phase error — see Tarighat and Sayed 217 
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Problem [II.33 (Constant-modulus algorithm) The constant-modulus algorithm CMA2-2 is a 

stochastic gradient version of the steepest-descent algorithm developed in Prob. III.18. 


(a) Replace the term yRywi-1 — E[ u*wwi;-1|uwi-i[?] by the instantaneous approximation 
*uiuiWi-i — ut wi wi-i|uiwi-i|?, and define z(i) = u;wi_1. Verify that this leads to the 
recursion 


wi = wi-id puž z(i) [y -|z()?], z() = uwii, 120 


Remark. This recursion is known as CMA2-2. The numbers 2-2 refer to the fact that the cost function 
in this case (cf. Prob. IIL.18) is of the form E (v ~- [uw|?) 2 which isa special case of the more general 
cost function E (y — |uw|?)? for the values p = 2 and q = 2. 


(b 


— 


Can you guarantee, as in the steepest-descent method of Prob. III.18, that the estimates w; in 
CMA2-2 will always be real-valued for any real-valued initial condition 10—;? Justify your 
answer. 


Problem Ill.34 (Reduced-constellation algorithm) The reduced-constellation algorithm (RCA) 
is a stochastic gradient version of the steepest-descent algorithm developed in Prob. III.19. Replace 
the expectations in the recursion of part (a) of Prob. III.19 by instantaneous approximations and 
verify that this substitution leads to the following recursion. Let 2(¢) = ujwi-1. Then 


wi = wi-1 + uui (yesgn(z(3)) — z(i)) 
for some step-size jz > 0. 


Remark. Recall from the discussion in Prob. IIL.19 that RCA attempts to minimize the distance between the 
output of the filter and four points in the complex plane, namely, y(+1 + j). In other words, for multi-level 
constellations, RCA attempts to minimize the mean-square error between the output of the filter and a reduced 
number of symbols, which may not belong to the original signal constellation. Compared with CMA, the RCA 
method is simple to implement but tends to face convergence difficulties (see, e.g., Werner et al. (1999)). 


Problem ili.35 (Multi-modulus algorithm) The multi-modulus algorithm (MMA) is a stochastic- 
gradient version of the steepest-descent algorithm developed in Prob. III.20. Replace the expectations 
in the recursion of Prob. III.20 by instantaneous approximations and verify that this substitution leads 
to the following algorithm. Let z(7) = uiwi-:. Then 


z(i) = uiwi-i 

a(i) = Re[z(z)] 

b(i) = Im[z(1)] 

eli) = ali)iy — a(d] + sly — BW) 
wi = Wi-1 + puie(i) 


for some step-size |, > 0. How is this recursion different from that of CMA2-2 from Prob. III.33? 


Remark. The MMA scheme was proposed by Yang, Werner, and Dumont (1997). Recall from Prob. III.20 that 
MMA attempts to minimize the distance between the output of the filter and the horizontal and vertical lines 
located at +,/7. Specifically, RCA attempts to minimize the dispersion between the real and imaginary parts of 
the filter output around the value of »y. 


Problem IlI.36 (Another constant-modulus algorithm) The constant modulus algorithm CMA1- 
2 is a stochastic-gradient version of the steepest-descent algorithm developed in Prob. II1.21. 


(a) Replace the expectations in the recursion of part (a) of Prob. III.21 by instantaneous approxi- 
mations and verify that this substitution leads to the following recursion. Let z(i) = uiwi- 1. 
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for some step-size u > 0 (when z(i) = 0 we set w; = wi-1). 


Remark. This recursion is known as CMA1-2 because the cost function in this case (cf. Prob. IIL21) is of 
the form E (y — Juwl)?. which is a special case of the more general cost function E (^ — |uw|?)? for 
the values p = 1 and q = 2. 


(b) Let w; = wi-1 + dw. Given w;-1, show that the weight vector w; with the smallest pertur- 
bation ów that solves the optimization problem 


min juw: — uiwi—1|? subject to ju;w:i| = y 
Wi 


is given by 


patties pu af CERT o 2 
maw rie Cristi ~~) 


Remark. Inserting a step-size 1, into the above recursion leads to the normalized CMA algorithm, 


* 


wi = wii um ( Bo - 20) 


ei? VOI 


Problem Ill.37 (Gauss-Newton algorithm) A stochastic-gradient method that is similar in na- 
ture to the RLS algorithm of Sec. 14.1 can be obtained in the following manner. 


(a) Let e(i) = A’**e/(i+ 1) and u(i) = p. Use the same approximations as in the RLS case for 
Ra and (Rau — Rywi-1) to verify that the regularized Newton's recursion (14.1) reduces to 


Wi = Wwi-id po," u; [d(z) — uiwi-i] 
where ' 
a f Mile E Wiss 
$5, = Roc =o lutu: i > 
i (y+ eu ujuj |, i20 


(b) Show that ©; satisfies the recursion 6; = A[1 — o(i)]9;-1 + o(i)uT ui fori > 1, with initial 
condition Po = AeI + uguo, and where a(i) = 1/(¢ + 1). 


(c) Define P; = 6; !. Show that 


AC A`! P;-iut Pins 
P= ~~ |B -n i>1 
[1 = a(i)] | i ice T Alu Pi- iu 
wi = wi-1 + uP; uj[d(i)—uwi-1], i20 


with initial condition 


e uguo 
hex Seq] 


(d) Repeat the calculations in Tables 5.7 and 5.8 and estimate the amount of computations that 
are required per iteration for both cases of real and complex-valued data. 


Problem III.38 (Simplified GN algorithm) Consider the same setting as in Prob. III.37. An al- 
ternative form of the GN algorithm can be obtained by replacing a(i) by a constant positive number 
o (usually small, say 0 « o « 0.1). 


(a) Verify that the recursions of part (c) of Prob. III.37 reduce to 


ATA ATIP; autu P; 
P.: = — P; ENDS AE Sd E qu -lI 
K awl ü-9) + A-I Put |’ aes 


wi = wilt pP;, uj[di)—uwi-i], i20 


(b) Show further that the above recursions could have been derived directly from Newton’s recur- 219 
sion (14.1) by using the following sample average approximation for Ry, Ru = o 55; o [A(1— ? Part IIl 
a)i uju;, along with the choice e(i)  A6*9 (1 — o) Me. ROBLEMS 


Problem 11.39 (Sample covariance matrices) Consider the sample covariance matrix R, = 
+ Y ;j0 uj uy used in Sec. 14.1 (with exponential weighting). Let us denote Ra by Ra, i to indicate 
that it is based on data up to time ?. 


(a) Verify that Ru, satisfies the recursion Ru; = S Rui- + gurui. 


(b) Consider instead the sample covariance matrix Ry; = a Yi_o(1 — o)'7^uju; used in 
Prob. IIL.38. Verify now that Rui = (1- a) Ruin + au ui. 


Problem Itl.40 (Conversion factor) Show that the RLS algorithm can be written in the equiva- 
lent form 
7} Pii ui 

14+ A-luPiiul 
Show further that r(i) = *(i)e(i) where y(i) = 1/(14- A7! ui P;-1ui) and r(i) and e(i) denote the 
a posteriori and a priori output errors, r(i) = d(i) —u;w; and e(¢) = d(t) — uiwi-1. Conclude that, 
for all i, |r(i)| < |e(i)|. Remark. The coefficient y(i) = 1/(1 + A7 lu; Pj -1u£) is called the conversion 
factor since it converts the a priori error to the a posteriori error. We shall have more to say about the RLS 
algorithm, its properties, and its variants in Parts VII (Least-Squares Methods) through X (Lattice Filters). 


Wi = Wi-1 + [d(i) —uiwi-1], i20. 


Problem Ill.41 (Bound on the step-size for e—APA) Refer to the discussion in Sec. 13.3 and 
to the definitions of the error vectors (ei, ri). 


(a) Use the constraint in (13.8) to show that [|ri||? < ||e:||? if, and only if, the matrix I — A is 
positive-definite, where 


A Ê (I- pU;U? (+ U;U?) ")* (L- pU;U? (el + UU?) ?) 


Show further that ||r;||? = ||e:|[?^ if, and only if, e; = 0. 


(b) Let U;U? = V;T;V; denote the eigen-decomposition of the K x K matrix U;U;7 , where V; is 
unitary and T; = diag{yo(z), y1(2), . .. "y -1(2)) contains the corresponding eigenvalues. 
Show that I — A = 1L VT (21 — nPD)V7, where T} = r(e + Ti). 


(c) Conclude that I — A > 0 if, and only if, 


O<SkSK-1 — w«(i) 


Conclude that 0 < u < 2 guarantees ||r;||? < ||e;|?. 


Problem 11.42 (Least-perturbation property of €-APA) Refer to the optimization problem (13.8). 
(a) Introduce the difference dw = wi — wi-1. Show that the constraint amounts to the require- 
ment U;ów = uU;U; (el + UUs) tei. 
(b) Verify that the choice ów? = LU; (el + U;U7) | ei satisfies the constraint. 


(c) Now complete the argument, as in Sec. 10.4, to show that the solution of (13.8) leads to the 
e-APA recursion (13.5). 


Problem ill.43 (Gram-Schmidt orthogonalization) Consider three row vectors (ui, u2, us} 
and define the transformed vectors (also called residuals): 


$ didi tz tie 
uj = uj, U2 = U2— U2 


~ — U37 a 
"Tale Taal? 
(a) Verify that the residual vectors so obtained are orthogonal to each other, i.e., show that &:ŭ} = 
0 fori # j. 


220 (b) Verify further that (ui, u2, us) and (di, de, G3} are related via an invertible lower triangular 
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A = EAN 1 
te | = ||à: ||? 7 ua 
üa __ Us (i _ Bue ) ub — uti 1 u3 
jā? |ü2l? ||à21|2 


Problem 1il.44 (APA with orthogonal correction factors) Consider an APA update of order 
K —3,ie, e; = di — Uiwi-1 and w; = wii + MU? (U;iU?) | ei, where 


Ui d(i) 
U= | wei |; di = d(i — 1) 
Ui-2 d(i = 2) 


In this problem we want to show that APA can be implemented in an equivalent form that involves 
only orthogonal regression vectors. So assume that, at each iteration i, the regressors (ui, ui—2, ui-s) 
are first orthogonalized, as described in Prob. H1.43, and let (à;, %i-1, à;..2) denote the correspond- 
ing residual vectors (with &; = ui). Let L; denote the lower-triangular matrix relating the residuals 
to the regressors, i.e., 
P ^ ŭi 
U; = | dii | = LiUi 
üi-2 
P E 

(a) Verify that the APA update can be written as w; = wi-i-- pU? (0,07) Liei, where 

Ü,Ü? is diagonal and given by Ü,Ü7 = diag { ||ài|^, ||ài-i l2, . . .. || i2]? }. 
(b) Show that the entries of Lie; are given by Lie; = col(e(i), e? (i — 1), e (i — 2)}, where 


eli) = d(é)—uswi-1, e? (i-1) = d(i-1)-ui-1wO, e? (i-2) = d(i-2)-u, 2w(?, 


and (w (P, , w(?) } are intermediate corrections to w;~1 defined by 


i- 


1 A ü y 2 A^, Ui (y 
wp B wart jap À wj), Swi + Tea i-1) 
(c) Conclude that a general K —th order APA update can be equivalently implemented as follows. 


For each time instant 7, start with w, = wi-1 and repeat for k —0,1,..., K — 1: 


e? (i — k) = d(i — k) — uiw, 
(k-1) 0 Wik (k); 
= e“ {i— k 
|i]? Sc 


i-i 5 Wisi 


Then set 


Remark. This form of the algorithm is sometimes referred to as APA or NLMS with orthogonal correction 
factors (APA-OCF or NLMS-OCF) since the regressors (i) that are used in the update of the weight 
vector are orthogonal to each other. 


(d) Verify further from part (c) that w; is given by the convex combination 


wi = (1 — p)wi-1 + pwi 


Problem Ill.45 (LMS as a notch filter) Consider an LMS adaptive filter with a regression vector 
u; with shift-structure, i.e., its entries are delayed replicas of an input sequence {u(i)}, as in an FIR 
implementation, 


w-[wi wi-1 .. wi-M-] 


Assume u(i) is sinusoidal, say u(i) = e?“e* for some wo. Assume also that the filter is initially 
at rest. Let w; denote the coefficients of the LMS filter, which are adapted according to the rule 
wi = wi—ı uut [d(i) — uwi-1]. Although the coefficients of the filter vary with time, due to the 
adaptation process, it turns out that the input-output mapping is actually time-invariant in this case. 


(a) Show that the transfer function from the desired signal d(:) to the error signal e(i) = d(i) — 
uiwi-1 is given by 
E(z) _ z — eio 
D(z)  z-(uM-I1)eiwe 
(b) Assume the adaptive filter is trained instead by e— NLMS. Show that the same transfer func- 
tion becomes 


E(z) z — e"o 
D(z) pM r 
z+ (=r = ej 


What does this result reduce to when M — 00? 


(c) Assume the filter is trained using the power normalized e—NLMS algorithm (cf. Alg. 11.2). 
Show that in the limit, as i — oo, the transfer function from d(?) to e(z) is again given by 


E(z) _ z — gi» 
D(z) — uM a 
z+ (2 -lje 


Remark. The results indicate that, with a sinusoidal excitation, the LMS filter behaves like a notch filter with 
a notch frequency at wo. Applications to adaptive noise cancelling of sinusoidal interferences can be found in 
Widrow et al. (1976) and Glover (1977). 


Problem Ill.46 (Adaptive line enhancement) Let d(i) = s(i) + v(i) denote a zero-mean ran- 
dom sequence that consists of two components: a signal component, s(i), and a noise component, 
v(i). Let rs (k) and r,(k) denote the auto-correlation sequences of (s(i), v(z)}, assumed stationary, 
ie., rs(k) = Es(i)s"(i — k) and r.(k) = Ev(i)v* (i — k). 


dli) = s(i) + v(i) 


FIGURE IIl.1 An adaptive structure for line signal enhancement. 
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Assume that r,(k) is negligible for k > ôs, while rẹ (k) is negligible for k > ôn with ôs >> ôn. 
We say that s(i) corresponds to the narrowband component of d(i) while v(i) corresponds to the 
wideband component of d(i). The adaptive structure of Fig. II.1 is suggested for use in separating 
the signal s(2) from the noise v(i). As the figure indicates, realizations d(i) are used as the reference 
sequence, and a delayed replica of these same realizations is used as input to the tapped delay line. 
The value of the delay A is chosen to satisfy ôn < A < ds, and the filter taps are trained using LMS, 
for example. It is claimed that the output of the filter, u;w;- 1, provides estimates of the signal com- 
ponent, while the error signal, e(z), provides estimates of the noise component. That is, 6() = e(i) 
and á(7) = uiwi-1. Justify the validity of this claim. Hint: Refer to Prob. II.13. 


Remark. The adaptive structure described in this problem is known as an adaptive line enhancer (ALE); it permits 
separating a narrowband signal (e.g., a sinusoid) from a wideband noise signal. The ALE was originally developed 
by Widrow et al. (1975). Its performance was later studied in some detail by Zeidler et al. (1978), Rickard and 
Zeidler (1979), Treichler (1979), and Zeidler (1990). 


COMPUTER PROJECTS 


Project Ill.1 (Constant-modulus criterion) Refer to Prob. III.18, where we introduced the cost 
function J(w) = E (y — juw|?)?, for a given positive constant y. This cost arises in the context 
of blind equalization where it is used to derive blind adaptive filters — see, e.g., Prob. IIL.33. In 
this project, we use J(w) to highlight some of the issues that arise in the design of steepest-descent 
methods. 


(a) Assume first that w is one-dimensional and u is scalar-valued with variance c2 = E |u|?. 
Assume further that u and w are real-valued and that u is Gaussian so that its fourth moment 
is given by E u* = 304. Verify that, under these conditions, the cost function J(w) evaluates 
to J(w) = 4? — 2y02w? + 3o1w*. Conclude that J(w) has a local maximum at w = 0 
and two global minima at w? = +,/7/302. Plot J(w) and determine the values of w° using 
y = 1 and c2 = 0.5. Find also the corresponding minimum cost. 


(b) Argue that a steepest-descent method for minimizing J(w) can be taken as 
w(i) = w(i—1)--4uc2w(i - 1) —3e2w^(i 1), w(-1) = initial guess 


with scalar estimates {w(i)}. Does this method converge to a global minimum if the initial 
weight guess is chosen as w(—1) — 0? Why? 


(c 


— 


Simulate the steepest-descent recursion of part (b) in the following cases and comment on its 
behavior in each case: 


1. p = 0.2 and w(—1) = 0.3 or w(-1) = —0.3. 
2. u = 0.6 and w(—1) = 0.3 or w(—1) = —0.3. 
3. p= 1and w(-1) = -0.2. 


In each case, plot the evolution of w(i) as a function of time. Plot also the graph J[w(i)] x 
w(i). 


(d) Now consider the setting of part (b) in Prob. III.18 where w is two-dimensional. 
(d.1) Generate a plot of the contour curves of the cost function for y = 1, i.e., 


J(w) =? + 2(\al? + 218)? — 25 (Jof? + 218?) 


(d.2) Choose u = 0.02 and apply the resulting algorithm, 


a(t) | | o(i—1) z OT i- D? a(i — 1) 
FABPFRIE. 2la(i ~ 1) - 419( Jud 


starting from the initial conditions w~; = col{1, —0.25) and w_1 = col{—1, 0.25}. 
Iterate over a period of length N = 1000. Plot the trajectory of the weight estimates 
superimposed on the contour curves of J(w). 


Adaptive filters are used in many applications and we cannot attempt to cover all of them in a text- 
book. Instead, we shall illustrate the use of adaptive filters in selected applications of heightened 
interest, including channel equalization, channel estimation, and echo cancellation. The computer 
projects in this section will focus on channel equalization. In later parts of the book, the computer 
projects will consider applications involving line echo cancellation, channel tracking, channel esti- 
mation, acoustic echo cancellation, and active noise control. In addition, in some of the problems, 
other applications are considered, such as adaptive line enhancement in Prob. III.46. 


Project Ill.2 (Constant-modulus algorithm) In Prob. III.18 we introduced the constant-modulus 
criterion : 
min E (v — uw?) 


and developed a steepest-descent method for it, namely 
Wi = Wi-1 + H ( yRuwi-1 — E [ u'uwiijuwi-i? ] ) 


We reconsidered this method in Prob. III.33 and derived the corresponding stochastic gradient ap- 
proximation, known as CMA2-2, namely, 


wi = wi- + uutz()[y-lz)], z()- wiwi- 


In this project we compare the performance of the steepest-descent method and the CMA2-2 recur- 
sion. For this purpose, we set y = 1 and let u be a 2-dimensional circular Gaussian random vector 
with covariance matrix R, = diag{1,0}. We showed in Prob. III.18 that, under these conditions, 
the steepest-descent method collapses to the form 


a(i) |_| e- 1) 

8i) 8(i — 1) 
where w; = col{a(i), 8(2)). In addition, we showed in Prob. III.18 that the corresponding cost 
function evaluates to 


J(w;) = Y? + 2(le P + AOP — 2(1o)]? + 218?) 


(a) Plot the learning curve J(i) = J(wi-i) for the steepest-descent method; here w;-1 is 
the weight estimate that results from the steepest-descent recursion. Plot also an ensemble- 
average learning curve J(i) for the CMA2-2 algorithm that is generated as follows: 


: (s E eod. i>0 


j=l 


+ u(y - 2la(é - 1)? — at&(i - 1?) | a | 


with the data generated from L experiments, say of duration N each. Choose L = 200 
and N = 500. Use u = 0.001 and w-1 = col{—0.8,0.8}. Remark. We are assuming 
complex-valued regressors u, with the two entries of u having variances {1, 2}. In order to generate such 
regressors, create four separate real-valued zero-mean and independent Gaussian numbers {a, b, p, q} 
with variances (1/2, 1/2, 1, 1}, respectively, and then set u = (a + jb p+ jq]. 

Plot the contour curves of the cost function J(w) = E (y — |uw[?)? superimposed on the 
four weight trajectories that are generated by CMA2-2 for the following four choices of initial 
conditions: 


(b 


— 


wié{{os os].[-os os].[os -os],[-os -0.8 ]} 


Use ys = 0.001 and N = 1000 iterations in each case. Print the final value of the weight 
vector estimate in each case. Show also the trajectories of the weight estimates that are gener- 
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ated by the steepest-descent method. Remark. Although the weight estimates in the steepest-descent 
method are always real-valued for a real-valued initial condition w—1, the same is not true for CMA2-2. 
However, the imaginary parts of the successive weight estimates will be small compared to the real parts. 
For this reason, when plotting the weight trajectories, we shall ignore the imaginary parts. 


Project Ill.3 (Adaptive channel equalization) In Computer Projects II.1 and II.3 we dealt with 
the design of minimum mean-square error equalizers. In this project we examine the design of 
adaptive equalizers. We consider the same channel of Computer Project II.3, namely, 


C(z) = 0.5 + 1.227! + 1.527? — 273 


and proceed to design an adaptive linear equalizer for it. The equalizer structure is shown in Fig. III.2. 
Symbols {s(i)} are transmitted through the channel and corrupted by additive complex-valued white 
noise (v(i)). The received signal {u(i)} is processed by the FIR equalizer to generate estimates 
{8(i — A)}, which are fed into a decision device. The equalizer possesses two modes of operation: a 
training mode during which a delayed replica of the input sequence is used as a reference sequence, 
and a decision-directed mode during which the output of the decision-device replaces the reference 
sequence. The input sequence (s(2)) is chosen from a quadrature-amplitude modulation (QAM) 
constellation (e.g., 4-QAM, 16-QAM, 64-QAM, or 256-QAM). 


Training 


FIGURE lil.2 An adaptive linear equalizer operating in two modes: training mode and decision-direction 
mode. 


Decision 
directed 


nag 
e(i) 


(a) Write a program that trains the adaptive filter with 500 symbols from a QPSK constellation, 
followed by decision-directed operation during 5000 symbols from a 16-QAM constellation. 
Choose the noise variance c2 in order to enforce an SNR level of 30 dB at the input of the 
equalizer. Note that symbols chosen from QAM constellations do not have unit variance. 
For this reason, the noise variance needs to be adjusted properly for different QAM orders 
in order to enforce the desired SNR level — see Prob. II.16. Choose A = 15 and equalizer 
length L = 35. Use e—NLMS to train the equalizer with step-size jj = 0.4 and e = 107°. 
Plot the scatter diagrams of {s(i), u(i), 8(¢ — A)}. 


(b) For the same setting as part (a), plot and compare the scatter diagrams that would result at the 
output of the equalizer if training is performed only for 150, 300, and 500 iterations. Repeat 
the simulations using LMS with » = 0.001. 


(c) Now assume the transmitted data are generated from a 256-QAM constellation rather than a 
16-QAM constellation. Plot the scatter diagrams of the output of the equalizer, when trained 
with e-NLMS using 500 training symbols. 

(d) Generate symbol-error-rate (SER) curves versus signal-to-noise ratio (SNR) at the input of 
the equalizer for (4, 16, 64, 256)-QAM data. Let the SNR vary between 5 dB and 30 dB in 
increments of 1 dB. 


(e) Continue with SNR at 30 dB. Design a decision-feedback equalizer with L — 10 feedforward 
taps and Q — 2 feedback taps. Use A — 7 and plot the resulting scatter diagram of the output 


of the equalizer. Repeat for L = 20, Q = 2 and A = 10. In both cases, choose the 
transmitted data from a 64-QAM constellation. 


(f) Generate SER curves versus SNR at the input of the DFE for (4, 16, 64, 256)-QAM data. Let 
the SNR vary between 5 dB and 30 dB. Compare the performance of the DFE with that of 
the linear equalizer of part (d). 


(g) Load the file channel, which contains the impulse response sequence of a more challenging 
channel with spectral nulls. Set the SNR level at the input of the equalizer to 40 dB and select 
a linear equalizer structure with 55 taps. Set also the delay at A = 30. Train the equalizer 
using c€- NLMS for 2000 iterations before switching to decision-directed operation. Plot the 
resulting scatter diagram of the output of the equalizer. Now train it again using RLS for 
100 iterations before switching to decision-directed operation, and plot the resulting scatter 
diagram. Compare both diagrams. 


Project Ill.4 (Blind adaptive equalization) We consider the same channel used in Computer 
Project II.3, 
C(z) 20.5--1.227! + 1.527? — 27? 


and proceed to design blind adaptive equalizers for it. The equalizer structure is shown in Fig. II.3. 
Symbols {s(i)} are transmitted through the channel and corrupted by additive complex-valued white 
noise (v(i)). The received signal (u(i)) is processed by a linear equalizer, whose outputs {z(i)} 
are fed into a decision device to generate {8(i — A)}. These signals are delayed decisions and 
the value of A is determined by the delay that the signals undergo when travelling through the 
channel and the equalizer. In this project, the equalizer is supposed to operate blindly, i.e., without a 
reference sequence and therefore without a training mode. Most blind algorithms use the output of 
the equalizer, z(z), to generate an error signal e(i), which is used to adapt the equalizer coefficients 
according to the rule 
wi = Wi-1 + pu;e(i) 

where u; is the regressor at time i. Some blind algorithms use the output of the decision device, 
š(i — A), to evaluate e(2) (e.g., the “stop-and-go” variant described in part (d) below). 


v(i) 
ae Channel S 


Equalizer 


FIGURE Ili.3 A representation of a general structure for blind adaptive equalization. 


(a) Write a program that transmits 10000 QPSK symbols through the channel and trains a 35-tap 
equalizer using CMA2-2, 


wi = wi- + puzzli) [v - 0], 2() = wiwi 


Choose the value of y as E |s|*/E |s|?, which is in terms of the second and fourth-order mo- 
ments of the symbol constellation. For QPSK data, y = 1. Set the SNR level at the input 
of the equalizer to 30 dB and use jj = 0.001. Plot the impulse responses of the channel, the 
equalizer, and the combination channel-equalizer at the end of the adaptation process. How 
much delay does the signal undergo in travelling through the channel and the equalizer? Plot 
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the scatter diagrams of the transmitted sequence, the received sequence, and the sequence at 
the output of the equalizer. Ignore the first 2000 transmissions and count the number of erro- 
neous decisions in the remaining decisions (you should take into account the delay introduced 
by the channel-equalizer system). 


(b) Repeat the simulations of part (a) using 16-QAM data, for which y = 13.2 (verify this value). 
Use = 0.000001. Run the simulation for 30000 symbols and ignore the first 15000 for error 
calculation. These numbers are larger than in the QPSK case of part (a), and the step-size is 
also significantly smaller, since the equalizer converges at a slower pace now. 


(c) Repeat the simulations of part (b) using the multi-modulus algorithm (MMA) of Prob. III.35. 


(d) Repeat the simulations of part (b) using the following three additional blind adaptive algo- 
rithms: 


(d.1) 


(d.2) 


(d.3) 


CMA1-2 from Prob. III.36, where ^; is now chosen as y = E |s|?/E |s| in terms of the 
second moment of the symbol constellation divided by the mean of the magnitude of the 
symbols. For 16-QAM data we find y = 3.3385 (verify this value). Use = 0.0001 
and increase the SNR level at the input of the equalizer to 60 dB. Simulate for 30000 
iterations and plot the scatter diagram of the output of the equalizer after ignoring the 
first 15000 samples. 
The reduced constellation algorithm (RCA) of Prob. III.34, where y is now chosen as 
y = E|s|?/E|s|1, in terms of the second moment of the symbol constellation divided 
by the mean of the /;-norm of the symbols (remember that the lı norm of a complex 
number amounts to adding the absolute values of its real and imaginary parts, as in 
Prob. IIL.14). For 16-QAM data we find y = 2.5 (verify this value). Use the same 
step-size and same simulation duration as part (d.1). 
The “stop-and-go” algorithm is a blind adaptation scheme that employs the decision- 
directed error, 

ea(i) = š(i — A) — 2(2) 
It also employs a flag to indicate how reliable eg(z) is. The flag is set by comparing 
ea(i) to another error signal, say the one used by the RCA recursion, 


es(i) = yesgn(2(i)) — z(, y= Els?/Elsh 


If the complex signs of {ea(i), es(i) } differ, then ea(i) is assumed unreliable and the 
flag is set to zero (see Picchi and Prati (1987)). More explicitly, the stop-and-go recur- 
sion takes the form: 


2(t) = uiwi-1 
ea(i) = à(à— A) — z(i) ^ (decision-directed error) 
es(i) = yesgn(z(z)) — z(t) (RCA error) 


1 if esgn(ea(i)) = csgn(es(i)) 
fi) = Gag) 
0 X ifcsgn(ea(i)) Æ csgn(es(i)) 


e(i) = f(i)ea() 


wi = wi-1 + puie(i) 


Use the same step-size and simulation duration as in part (d.1). 
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Energy Conservation 


l, Part III (Stochastic-Gradient Algorithms) and its problems, we developed stochastic 
gradient approximations for several steepest-descent methods. The approximations were 
obtained by replacing exact covariance and cross-correlation quantities by instantaneous 
estimates. The resulting algorithms operate on actual data realizations and they lead to 
adaptive filter implementations. However, stochastic approximations introduce gradient 
noise and, consequently, the performance of adaptive filters will degrade in comparison 
with the performance of the original steepest-descent methods. 

The purpose of this chapter, and of the subsequent chapters in this part (Mean-Square 
Performance) and in Part V (Transient Performance), is to describe a unifying framework 
for the evaluation of the performance of adaptive filters. This objective is rather challeng- 
ing, especially since adaptive filters are, by design, time-variant, stochastic, and nonlinear 
systems. Their update recursions not only depend on the reference and regression data in a 
nonlinear and time-variant manner, but the data they employ are also stochastic in nature. 
For this reason, the study of the performance of adaptive algorithms is a formidable task, 
so much so that exact performance analyses are rare and limited to special cases. It is 
customary to introduce simplifying assumptions in order to make the performance analy- 
ses more tractable. Fortunately, most assumptions tend to lead to reasonable agreements 
between theory and practice. 

The framework developed in this chapter, and pursued further in the following chapters, 
relies on energy conservation arguments. While the performance of different adaptive fil- 
ters tends to be studied separately in the literature, the framework adopted in our presenta- 
tion applies uniformly across different classes of filters. In particular, the same framework 
is used for steady-state analysis, tracking analysis, and transient analysis. In other words, 
energy-based arguments stand out as a common theme that runs throughout our treatment 
of the performance of adaptive filters. 


15.1 PERFORMANCE MEASURE 


Before plunging into a detailed study of adaptive filter performance, we need to explain 
some of the issues that arise in this context, including the need to adopt a common perfor- 
mance measure and the need to model adaptive algorithms in terms of stochastic equations. 
We use the LMS filter as a motivation for our explanations. 

Thus, recall that the steepest-descent iteration (8.20), namely, 


wi = wi-1 + wl Rau — Ruwi-1] (15.1) 
was reduced to the LMS recursion (10.10), i.e., 


Wi = Wi-1 + uu; [d(2) — UjWi-1 (15.2) 
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by replacing the second-order moments Rg, = Edu* and R, = Eu*w by the instanta- 
neous approximations 


Ra, © d(iju3 and R, ze ufui (15.3) 


Other adaptive algorithms were obtained in Chapter 10 by using similar instantaneous 
approximations. Recall further that we examined the convergence properties of (15.1) in 
Chapter 8 in some detail. Specifically, we established in Thm. 8.2 that by choosing the 
step-size u such that 

0 < p< 2/ Amax (15.4) 


where Amax is the largest eigenvalue of Rq, the successive weight estimates w; of (15.1) 
are guaranteed to converge to the solution w? of the normal equations, i.e., to the vector 


w? = R} Rau (15.5) 
that solves the least-mean-squares problem 


min E |d - uw? (15.6) 


Correspondingly, the learning curve of the steepest-descent method (15.1), namely, 


J(i) = E|d — uv; il? 
2 * 
= ggq—HaWi-i — w; Ra + wj Ruwi-1 


is also guaranteed to converge to the minimum cost of (15.6), i.e., 


II 


J(i) — Jmin E|d - uw??? 


= of - RyRy Ra, (15.7) 


where o2 = E|d|?. 

Obviously, the behavior of the weight estimates w; that are generated by the stochastic- 
gradient approximation (15.2) is more complex than the behavior of the weight estimates 
wi that are generated by the steepest-descent method (15.1). This is because the w; from 
(15.2) need not converge to w° anymore due to gradient noise. It is the purpose of Parts 
IV (Mean-Square Performance) and V (Transient Performance) to examine the effect of 
gradient noise on filter performance, not only for LMS but also for several other adaptive 
filters. 


Stochastic Equations 
First, however, in all such performance studies, it is necessary to regard (or treat) the update 
equation of an adaptive filter as a stochastic difference equation rather than a deterministic 
difference equation. What this means is that we need to regard the variables that appear 
in an update equation of the form (15.2) as random variables. Recall our convention in 
this book that random variables are represented in boldface, while realizations of random 
variables are represented in normal font. 

For this reason, we shall write from now on (d(i), u;}, with boldface letters, instead 
of (d(i), uj). The notation d(i) refers to a zero-mean random variable with variance 03, 
while u; denotes a zero-mean row vector with covariance matrix R,, 


Eld(i)|? = o2, Eutu; = Ry, Ed(ijut = Ra, 
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In the same vein, we shall replace the weight estimates w; and w;.. in the update equa- 
tion of an adaptive algorithm by w; and w;—1, respectively, since, by being functions of 
{d(i), ui}, they become random variables as well. 

In this way, the stochastic equation that corresponds to the LMS filter (15.2) would be 


wi = wi- + wu; |d(i) — uiwi-1], (a stochastic equation) 


with the initial condition w. , also treated as a random vector. When this equation is 
implemented as an adaptive algorithm, it would operate on observations (d(i), u;} of the 
random quantities (d(i), u;}, in which case the stochastic equation would be replaced by 
our earlier deterministic description (15.2) for LMS, namely, 


wi = wi-i- pu; |d(i) — uiwi-1], (a deterministic equation) 


Similar considerations are valid for the update equations of all other adaptive algorithms. 
In all of them, we replace the deterministic quantities {d(2), ui, wi, wi—1} by random quan- 
tities (d(i), ui, wi, wi_1} in order to obtain the corresponding stochastic equations. 


Excess Mean-Square Error and Misadjustment 
Now in order to compare the performance of adaptive filters, it is customary to adopt a 
common performance measure across filters (even though different filters may have been 
derived by minimizing different cost functions). The criterion that is most widely used in 
the literature of adaptive filtering is the steady-state mean-square error (MSE) criterion, 
which is defined as 

MSE Ê iümEJje(i)]? (15.8) 


t— 00 


where e(7) denotes the a priori output estimation error, 


2 dü)-u, 


a 

— 
Ss. 

= 
II 


g 


(15.9) 


Obviously, if the weight estimator w;—1 in (15.9) is replaced by the optimal solution w° of 
(15.6), then the value of the MSE would coincide with the minimum cost (15.7), namely, 


2 -1 
Jmin = Og Rua, Rau 


For this reason, it is common to define the excess-mean-square error (EMSE) of an adap- 
tive filter as the difference 


EMSE Ê MSE — Jmin (15.10) 


It is also common to define a relative measure of performance, called misadjustment, as 


M Ê EMSE/Jmin (15.11) 


15.2 STATIONARY DATA MODEL 


Therefore, given a stochastic difference equation describing an adaptive filter, we are in- 
terested in evaluating its EMSE. In order to pursue this objective, and in order to facilitate 
the ensuing performance analysis, we also need to adopt a model for the data {d(z), wi}. 


To begin with, recall from the orthogonality principle of linear least-mean-squares esti- 
mation (cf. Thm. 4.1) that the solution w? of (15.6) satisfies the uncorrelatedness property 


Eu;(d(i) — ujw?) = 0 
Let v(i) denote the estimation error (residual), i.e., 
v(i) = d(i) — ww’ 
Then we can re-express this result by saying that (d(i), u; } are related via 
d(i) = uw? + v(i) (15.12) 


in terms of a signal v(i) that is uncorrelated with u;. The variance of v(i) is obviously 
equal to the minimum cost Jmin from (15.7), i.e., 


o? Ê Elv(i)? = Jui = 03 — Ru Rz Ra, (15.13) 


Linear Regression Model 

What the above argument shows is that given any random variables {d(z), u;) with second- 
order moments (c2, Ru, Rau}, we can always assume that (d(i), u;) are related via a 
linear model of the form (15.12), for some w°, with the variable v(i) playing the role of a 
disturbance that is uncorrelated with u;, i.e., 


dli) = uw? + wi), Elvi? =o0?,  Ew(iu; =0 (15.14) 


However, in order to make the performance analyses of adaptive filters more tractable, we 
usually need to adopt the stronger assumption that 


The sequence (v(i)] is i.i.d. and independent of all (u;) (15.15) 


Here, the notation i.i.d. stands for "independent and identically distributed". Condition 
(15.15) on v(i) is an assumption because, as explained above, the signal v(i) in (15.14) is 
only uncorrelated with u;; it is not necessarily independent of u;, or of all (uj) for that 
matter. Still, there are situations when conditions (15.14) and (15.15) hold simultaneously, 
e.g., in the channel estimation application of Sec. 10.5. In that application, it is usually 
justified to expect the noise sequence (v(i)) to be i.i.d. and independent of all other data, 
including the regression data. 

Given the above, we shall therefore adopt the following data model in our studies of 
the performance of adaptive filters. We shall assume that the data {d(i), u;} satisfy the 
following conditions: 


) There exists a vector w° such that d(i) = u;w? + v(i). 

) The noise sequence {v(i)} is i.i.d. with variance c2 = E |v(1)|?. 
) The noise sequence (v(i)) is independent of u; for all i,j. (15.16) 
) The initial condition w~; is independent of all (d(5), uj, v(7)). ' 
) The regressor covariance matrix is Ry = Euzu; > 0. 

) The random variables {d(i), v(i), u;} have zero means. 


We refer to the above model as describing a stationary environment, i.e., an environ- 
ment with constant quantities (w?, Ru, 02). In Chapter 20 we shall modify the model in 
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order to accommodate filter operation in nonstationary environments where, for example, 
w? will be time-variant — see Sec. 20.2. 


Useful Independence Results 

An important consequence of the data model (15.16) is that, at any particular time instant i, 
the noise variable v(i) will be independent of all previous weight estimators {w;, j < i}. 
This fact follows easily from examining the update equation of an adaptive filter. Consider, 
for instance, the LMS recursion 


wi = wi-1 + wu; [d(i) — uiwi-i] w_1 = initial condition 


By iterating the recursion we find that, at any time instant 7, the weight estimator w; can 
be expressed as a function of w_1, the reference signals (d(j), d(j —1),...,d(0)}, and 
the regressors {u;,u;-1,..., uo). We denote this dependency generically as 


wj = F| wis; d(j), dj — 1) ee .,d(0); Uj, Uji., UO j (15.17) 


for some function F. A similar dependency holds for other adaptive schemes. 

Now v(i} can be seen to be independent of each one of the terms appearing as an 
argument of F in (15.17), so that v(i) will be independent of w; for all j < i. Indeed, 
the independence of v(i) from {w_1,u;,...,Uo} is obvious by assumption, while its 
independence from {d(j),...,d(0)} can be seen as follows. Consider d(j) for example. 
Then from d(j) = uw? 4-v(j) we see that d(j) is a function of (u;, v(j)}, both of which 
are independent of v(i). 

Given that v(i) is independent of (w;,j < i}, it also follows that v(i) is independent 
of (10, j < i}, where 1; denotes the weight-error vector: 


Moreover, v(i) is also independent of the a priori estimation error e, (2), defined by 
aA 5 
€,(t) = UiWi-1 
This variable measures the difference between wu;w? and u;w;.,, i.e., it measures how 


close the estimator u;w;—1 is to the optimal linear estimator of d(1), namely d(i) = ww’. 
We summarize the above independence properties in the following statement. 


Lemma 15.1 (Useful properties) From the data model (15.16), it follows that 
v(i) is independent of each of the following: 


{w,; forj <i}, {tj forj <i}, and e;(i)- uii 


Alternative Expression for the EMSE 

Using model (15.16), and the independence results of Lemma 15.1, we can determine 
a more compact expression for the EMSE of an adaptive filter. Recall from (15.8) and 
(15.10) that, by definition, 


EMSE = lim Eje(i)? — Jmin (15.18) 


where, as we already know from (15.13), Jmin = o2. Using 


e(i) 


d(i) — uiwi-i 
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and the linear model (15.16), we find that 
eli) = v(t) + u;(w? — wi-i) 


That is, 
e(i) = v(i) + ea(î) (15.19) 


Now the independence of v(i) and e, (i), as stated in Lemma 15.1, gives 
Ele)? = Elv(i)|?  Eles())? = ez + Elea(i)? (15.20) 


so that substituting into (15.18) we get 


EMSE = lim Ele, (i)? (15.21) 


In other words, the EMSE can be computed by evaluating the steady-state mean-square 
value of the a priori estimation error e, (7). We shall use this alternative representation to 
evaluate the EMSE of several adaptive algorithms in this chapter. Likewise, from (15.8) 


and (15.20), we have 
MSE = EMSE + c? (15.22) 
Error Quantities 


For ease of reference, we collect in Table 15.1 the definitions of several error measures. 
Note that we refer to both {e(z), e,(i)} as a priori errors since they rely on w;_1, while 
we refer to both (r(i), ep(i)} as a posteriori errors since they rely on the updated weight 
estimator w;. Note also that we distinguish between e(i) and e,() by referring to the 
first one as output error. In the sequel, the estimation errors {€4(i), ep(i)} will play a 
prominent role. 


TABLE 15.1 Definitions of several estimation errors. 


Error Definition Interpretation 

eli) | d(i) - uswi-i a priori output estimation error 
r(i) d(i)— uiw; | a posteriori output estimation error 

dbi w? — wi weight error vector 


e«(i) UiWi-1 a priori estimation error 


e»(i) ubi a posteriori estimation error 
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15.3 ENERGY CONSERVATION RELATION 


Our approach to the performance analysis of adaptive filters in Parts IV (Mean-Square 

Performance) and V (Transient Performance) is based on an energy conservation relation 

that holds for general data {d(i), uj) (it does not even require the assumptions (15.16)). 
In order to motivate this relation, we consider adaptive filter updates of the generic form: 


wi = wi—ı + p už gle(i)], w- = initial condition (15.23) 


where g/-] denotes some function of the a priori output error signal, 
e(i) = d(i) — wwi-1 


Updates of this form are said to correspond to filters with error nonlinearities. Several of 
the adaptive algorithms that we introduced in Chapter 10 are special cases of (15.23) with 
proper selection of the error function g|], as shown in Table 15.2. Observe that the listing 
in the table excludes leaky-LMS, RLS, APA, and CMA algorithms. We shall study these 
algorithms later by following a similar procedure — see, e.g., Chapter 19 and Probs. IV.7, 
IV.16-IV.18 and V.29-V.30. We can also study update equations with data (as opposed to 
error) nonlinearities, say, of the form 


wi=wi-itpgluijuje(t),  w.i- initial condition 


for some positive function g/-] of the regression data, g[u;] > O (the function g[.] could 
also be matrix-valued). For example, e~NLMS is a special case of this class of filters by 
choosing g[u;] = 1/(€ + ||u;||?). Likewise, LMS is a special case by choosing g[u;] = 1. 
Most of the results in this chapter, especially those pertaining to the energy-conservation 
relation of this section, apply to filters with data nonlinearities — see, e.g., Probs. IV.9 
and V.22. In the sequel we focus on updates with error nonlinearities. 


TABLE 15.2 Examples of error functions for several adaptive algorithms: e is a small positive 
number, 0 < 6 < 1, and p(i) is a positive quantity whose computation will be explained later. 


Algorithm Error functlon 


e-NLMS gle(i)]  eG)/(e + lbusl?) 


e-NLMS with power normalization gle(i)] = e()/(e + p(t) 
LMF ge()] = e(é)Je(i)? 
LMMN gle(s)} = e(i) [ő + (1 - ô)le(i)l? 
sign-error LMS g[e(i)] = csgn[e(i)] 


The update recursion (15.23) can be rewritten in terms of the weight-error vector 


[^] 
Wi ZW — Wi 


Subtracting both sides of (15.23) from w° we get 


w^—w;-u?^-wia-uu,gle(i), Ù- = initial condition 


i, = Wj) — purgle(i)} (15.24) 


In addition, if we multiply both sides of (15.24) by u; from the left we find that the a priori 
and a posteriori estimation errors (eq (i), ej (1)) are related via: 


or, equivalently, 


e (i) = ea(i) — ujui Pole] (15.25) 
where (e; (i), ez (i)) were defined in Table 15.1 as 


e«(i) = Ai 1, ey(i) = DX (15.26) 


Expressions (15.24)-(15.25) provide an alternative description of the adaptive filter 
(15.23) in terms of the error quantities (e; (i), ep (1), Wi, Wi-1, e(i)). This description 
is useful since we are often interested in questions related to the behavior of these errors, 
such as: 


1. Steady-state behavior, which relates to determining the steady-state values of E ||»; ||]? 
E les (i)]*, and E |e(i)P*. 


2. Stability, which relates to determining the range of values of the step-size yz over 
which the variances E|e,(i)|? and E ||ip;||? remain bounded. 


3. Transient behavior, which relates to studying the time evolution of the curves E |e, (2) |? 
and (E w,, E ||i;|]?). 


In order to address questions of this kind, we shall rely on an energy equality that relates 
the squared norms of the errors {e,(i), ej (i), Wi-1, Wi}. 


Algebraic Derivation 

To derive the energy relation, we first combine (15.24)-(15.25) to eliminate the error non- 
linearity g[-] from (15.24), i.e., we solve for g[-] from (15.25) and then substitute into 
(15.24), as done below. What this initial step means is that the resulting energy relation 
will hold irrespective of the error nonlinearity. We distinguish between two cases: 


1. w; = 0. This is a degenerate situation. In this case, it is obvious from (15.24) and 
(15.25) that Ù; = W;-1 and ea(i) = e, (1) so that 


lõ? = facil? and jea(é)/? = les)? (15.27) 


2. u; Æ 0. In this case, we use (15.25) to solve for g[e(i)], 


: 1 ; : 
gle(i)] = mee ee - e«(i)] 
and substitute into (15.24) to obtain 
Ù; = Wi-1 — ae [es (i) — ep(?)] (15.28) 


lui]? 


235 


SECTION 15.3 
ENERGY 
CONSERVATION 
RELATION 


236 


CHAPTER 15 
ENERGY 


CONSERVATION 


This relation involves the four errors (45;, Wi—1, e; (i), es (1) }; observe that even the 
step-size 4 is cancelled out. Expression (15.28) can be rearranged as 


[ES B yal?) = wii a 5 ep(t) (15.29) 

A ile 
On each side of this identity we have a combination of a priori and a posteriori 
errors. By evaluating the energies (i.e., the squared Euclidean norms) of both sides 
we find, after a straightforward calculation, that the following energy equality holds: 


1 1 
lül? + leai)? = lial? lest (15.30) 
METTE all + qu galt 
Interesting enough, this equality simply amounts to adding the energies of the indi- 
vidual terms of (15.29); the cross-terms cancel out. This is one advantage of working 
with the energy relation (15.30): irrelevant cross-terms are eliminated so that one 
does not need to worry later about evaluating their expectations. 


The results in both cases of zero and nonzero regression vectors can be combined together 


by using a common notation. Define fi(i) = (luil), in terms of the pseudo-inverse 
operation. Recall that the pseudo-inverse of a nonzero scalar is equal to its inverse value, 
while the pseudo-inverse of zero is equal to zero. That is, 


-=n â 1/|ul? if u; #0 
BUS { 0 otherwise aoe) 


Using ji(z), we can combine (15.27) and (15.30) into a single identity as 


lail? + a)ies (OP = lioi-il? + Baleli)? (15.32) 


We can alternatively express (15.32) as 
lui? - fios? + lea)? = lud? - loi? + les? 
——————Á——————— TNR 


Theorem 15.1 (Energy conservation relation) For adaptive filters of the form 
(15.23), and for any data (d(i), u;), it always holds that 


lõi? + aG)leaG)? = liil? + n)tes 


where ea(i) = u;W;-1, ey (i) = UW, dv; = w? — wi, and ji(i) is defined as 
in (15.31). 


The important fact to emphasize here is that no approximations have been used to establish 
the energy relation (15.32); it is an exact relation that shows how the energies of the weight- 
error vectors at two successive time instants are related to the energies of the a priori and 
a posteriori estimation errors. 


Remark 15.1 (Interpretations of the energy relation) In App. 15.A we provide several in- 
terpretations for the energy relation (15.32): one interpretation is geometric in terms of vector pro- 
jections, a second interpretation is physical and relates to Snell’s law for light propagation, and a 
third interpretation is system-theoretic and relates to feedback concepts. These interpretations are 


not needed for the subsequent derivations in this chapter. Nevertheless, they provide the reader with 
some insights into the energy relation. 


o 


15.4 VARIANCE RELATION 


Relation (15.32) has important ramifications in the study of adaptive filters. In this chapter, 
and the remaining chapters of this part, we shall focus on its significance to the steady-state 
performance, tracking analysis, and finite-precision analysis of adaptive filters. In Part V 
(Transient Performance) we shall apply it to transient analysis, and in Part XI (Robust 
Filters) we shall examine its significance to robustness analysis. In the course of these 
discussions, it will become clear that the energy-conservation relation (15.32) provides a 
unifying framework for the performance analysis of adaptive filters. 

With regards to steady-state performance, which is the subject matter of this chapter, it 
has been common in the literature to study the steady-state performance of an adaptive fil- 
ter as the limiting behavior of its transient performance (which is concerned with the study 
of the time evolution of E|»;||?). As we shall see in Part V (Transient Performance), 
transient analysis is a more demanding task to pursue and it tends to require a handful of 
additional assumptions and restrictions on the data. In this way, steady-state results that 
are obtained as the limiting behavior of a transient analysis would be governed by the same 
restrictions on the data. In our treatment, on the other hand, we separate the study of the 
steady-state performance of an adaptive filter from the study of its transient performance. 
In so doing, it becomes possible to pursue the steady-state analysis in several instances 
under weaker assumptions than those required by a full blown transient analysis. 


Steady-State Filter Operation 
To initiate our steady-state performance studies, we first explain what is meant by an adap- 
tive filter operating in steady-state. 


Definition 15.1 (Steady-state filter) An adaptive filter will be said to operate 
in steady-state if it holds that 


Ed; — s, asi- oo (15.33) 


Eww; — C, asi—oc (15.34) 


where s and C are some finite constants (usually s — 0). 


In other words, the mean and covariance matrix of the weight error vector of a steady-state 
filter tend to some finite constant values (s, C). In particular, it follows that the following 
condition holds 

|? 


E ||eo, ||? = E l&i? =c, as i— oo (15.35) 


where c = Tr(C). Of course, not every adaptive filter implementation can be guaranteed 
to reach steady-state operation. For example, as we shall see later in Chapter 24, if the 
step-size u in an LMS implementation is not sufficiently small, the filter can diverge with 
the error signals €a(i) and Ù; growing unbounded. We shall have much more to say in 
Part V (Transient Performance) about conditions on the step-size parameter in order to 
guarantee filter convergence to steady-state (i.e., stability). The main conclusion of Part 
V (Transient Performance) will be that sufficiently small step-sizes in general guarantee 
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the convergence of adaptive filters to steady-state operation. By assuming in the current 
chapter that a filter is operating in steady-state, we are in effect attempting to quantify be- 
forehand the performance that can be expected from the filter once it reaches steady-state. 
Such qualifications are useful at the design stage. 


Variance Relation for Steady-State Performance 

In order to explain how (15.32) is useful in evaluating the steady-state performance of an 
adaptive filter, we recall from (15.21) that we are interested in evaluating the steady-state 
variance of €a (i). Now taking expectations of both sides of (15.32) we get 


E||&;|? + E Allee (i)|? = E || wi-1||? + E Al) lep(i)|? (15.36) 


where the expectation is with respect to the distributions of the random variables {d(i), u;}. 
Taking the limit of (15.36) as 1 — oo and using the steady-state condition (15.35), we ob- 
tain 
Efi(i)|ea(i)? = Ep(i)lep(i)!?, asi — oo (15.37) 

This equality is in terms of (e; (i), e; (i)). However, from (15.25) we know how e;(i) is 
related to e, (i). Substituting into (15.37) we get 

En()leaG)* = EaG) les) — ullull'gle()]| , — asi + oo (15.38) 
Expanding the term on the right-hand side and simplifying leads to (we are omitting the 
argument of g for compactness of notation): 
ailes) + y?|Ius||?|g/? — peali)g* — wes (i)g 
A(i)|ea(i)/? + p?\luill*lgl? — 2uRe(ez(i)g) (15.39) 


iG) |ea(é) — ullu l?a]? 


where in the first equality we used the fact that 
gG)lwil* = uil? and Aflu: eali)g" = e«(9" 


for all u; (whether u; = 0 or otherwise). Taking expectation of the right-hand side of 
(15.39) and substituting into (15.38) we obtain 


(15.40) 


in terms of the real part of e? (7) g[e(i)]. Alternatively, this result can be obtained by starting 

directly from the weight-error vector recursion (15.24) and equating the squared Euclidean 

norms of both sides, i.e., 
|| io, |? lõi- — wuz gl? 

|i? + i? asl? lgl? — 2uRe(ez (2)9) 


t 


Taking expectations of both sides as 1 — oo gives (15.40). This alternative argument masks 
the presence of the useful energy conservation relation (15.32). In summary, we find that 
under expectation and in steady-state, the energy relation (15.32) leads to the following 
equivalent form. 


——_ | 


Theorem 15.2 (Variance relation) For adaptive filters of the form (15.23) and 
for any data {d(i), u;), assuming filter operation in steady-state, the following 
relation holds: 


HE (wil? - ole CDI") = 2Re(Eez(Dele(D]), as i — oo 
For real-valued data, this variance relation becomes 


HE (uill? - g?le()]) = 2Eea(i)g[e(é)], asi — oo 


We remark again that the variance relation (15.40) is exact, since it holds without any 
approximations or assumptions (except for the assumption that the filter is operating in 
steady-state, which is necessary if one is interested in evaluating the steady-state perfor- 
mance of a filter). We refer to (15.40) as a variance relation since it will be our starting 
point for evaluating the variance E |e, (i)|? for different adaptive filters. The results of both 
Thms. 15.1 and 15.2 do not require the analysis model (15.16); they hold for general data 


{d(i), wi}. 


Relevance to Mean-Square Performance Analysis 
However, when model (15.16) is assumed, then we know from (15.19) that e(i) can be 
expressed in terms of e, (i) as 


e(i) = ea(i) + v(i) 


In this way, relation (15.40) becomes an identity involving e, (i) and, in principle, it could 
be solved to evaluate the EMSE of an adaptive filter, i.e., to compute E le;(oc)|?. We say 
“in principle" because, although (15.40) is an exact result, different choices for the error 
function g[-] can make the solution for E |e,(00)|? easier for some cases than others. It is 
at this stage that simplifying assumptions become necessary. We shall illustrate this point 
for several adaptive filters in the sections that follow. 

In order to simplify the notation, we shall employ the symbol ¢ to refer to the EMSE of 
an adaptive filter, i.e., 

Ç=E Jea (00) |? 

For example, the EMSE of LMS will be denoted by C" V5. Its misadjustment will be de- 
noted by MMS, Similar notation will be used for other algorithms. In view of the analysis 
model (15.16), which enables us to identify Jmin as c2, we obtain from (15.11) that the 
misadjustment of an adaptive filter is related to its EMSE via 


M = EMSE/c? 


We limit our derivations in the sequel to determining expressions for the EMSE of several 

adaptive filters. Expressions for the misadjustment would follow by dividing the result by 
2 

Cs. 


15.A APPENDIX: ENERGY RELATION INTERPRETATIONS 


We end this chapter with three interpretations for the energy relation (15.32). The first one is geo- 
metric and relates to the projection of vectors onto one another. The second interpretation is physical 
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and relates to Snell's law in optics, and the third interpretation is system-theoretic and relates to feed- 
back concepts. These interpretations provide useful insights into the nature and origin of the energy 
relation. 


Geometric Interpretation 


The energy relation (15.32) can be motivated geometrically by observing that every update of the 
form (15.23) enforces a certain geometric constraint. Specifically, from the weight error recursion 
(15.24) we have that 
Wi-1 — Wi = pui g[e(i)] 

which shows that the difference #;-1 — Ù; is parallel to the regression vector uj (we ignore in 
this argument the degenerate case u; = 0, for which the energy relation trivializes to ||2; ||? = 
{\¢;-1||?). This property is depicted in Fig. 15.1, which shows the vectors {w@;-1, Ùw, u7} drawn 
from the origin of the M —dimensional space in which they lie. The distances from the vertices of 
the error vectors (15;, Wi-1} to wf should therefore agree. The energy-conservation relation (15.32) 
is a statement of this fact, as we now elaborate. 

First we need to explain the notion of projecting one vector onto another. We shall study such 
projection problems in great detail in Sec. 29.1 in the context of least-squares problems; at that 
point much of the discussion below will become self-evident. Here we follow a more elementary 
exposition. 

Let 0 denote the acute angle between two real-valued column vectors {x,y}. That is, 0 € 
{0, 7/2] and its squared cosine is given by 


T 12 
2 A ly z] 
cos'(0) = —————— (15.41) 
Fal? uif? 
where || - || denotes the Euclidean norm of its argument. The ratio y' z/(|izl| - ||yl|) always evalu- 


ates to a real number that lies within the interval [—1, 1]. This fact follows from Cauchy-Schwartz 
inequality, which states that for any two vectors (z, y), it holds that |y z|? < ||yl|? - ||x||?. If we 
project x onto y (see Fig. 15.2), then the squared norm of this projection is ||z||? cos?(6). Like- 
wise, if we project x onto the orthogonal direction to y, then the squared norm of this projection is 
lll? sin? (6) where 
_ d'a? 

læt? - [ly]? 

Similar conclusions hold when the vectors {x,y} are complex-valued. That is, if we project x 
onto y then the squared norm of this projection would still be given by ||z||? cos? (0) where the term 
cos? (8) is now defined as 


sin?(0) = 1 


29) A BEN 15.42 
eos (0 = Te Til ede 
put gle(i)] 


Equal distances 


FIGURE 15.1 Geometric interpretation of adaptive updates of the form (15.23). The difference 
1b;—,; — Ù; remains parallel to uj. 


i||| sin(6) 


\|2|| cos(8) y 


FIGURE 15.2 Projecting x onto y and onto the orthogonal direction to y. 


with y“ z replacing y" x. Likewise, if we project z onto the orthogonal direction to y, then the squared 
norm of the resulting projection would again be given by |||? sin?() where now 


in? Aq ly*z[? 
0 717 fap fai? nd 


Returning to Fig. 15.2, we therefore find that the squared distance from the endpoint of w;_1 to 
uj is equal to | tv;—.1 |? sin?(0;..1), where (cf. (15.43): 
12 A [uitbi il les (I? 
sin" (Qi1 B 1- oD = 1 
Hep [iii ||? - [luz ll? [ii i? ull? 


and 6;.., is the acute angle between w;~1 and už. Similarly, the squared distance from the endpoint 
of d»; to uf is equal to [[iv;||? sin?(0;), where 


_ lel? 
liii? - llul]? 


l A justi, |? 
sm(4j 2 1- Té]? - luz [i2 


and 6; is the acute angle between Ù; and uj. Since the distances from the vertices of {w;, ib; 1) 
to u; should agree, it must hold that 


liil? 11— MA = ju. i les (2)? 


Visi] fuf (Peal? eP 


This equality is simply the energy relation (15.32). 


Physical Interpretation 


The geometric argument reveals that the distances from the vertices of w;-1 and tb; to u7 should 
coincide and that, therefore, 


lõi? sin?(6;-1) = |e. ||? sin?(0,) (15.44) 


where the quantities {sin?(@;_1), sin?(0;)) are defined by the ratios 


lea (i)? 
il]? - Iul? 


A i28, & le» (i)? 
and sin^(6j) = 1 Wal fus 
(15.45) 
Equality (15.44) reminds us of a famous result in optics relating the refraction indices of two medi- 
ums with the sines of the incident and refracted rays of light. More specifically, consider two medi- 
ums with refraction indices {71,72}. Consider further a ray of light that impinges on the layer 
separating both mediums at an angle 0; relative to the vertical direction (see Fig. 15.3). The re- 
fracted ray of light exits the layer from the other side at an angle 65, also relative to the vertical 
direction. Both angles (61, 02) are restricted to the interval [0, 7/2]. It is a well-known result from 
optics, known as Snell’s law, that the quantities (71, n2, 01, 02) satisfy 1 sin 01 = m sin 02. 


sin?(6;-1) = 


1- 
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FIGURE 15.3 A ray of light impinging on the layer separating two mediums with different 
refraction indices. 


Now observe that it follows from (15.44) that 


\jwi-r||sin(@i-1) = [iil sin(6:) (15.46) 


in terms of the acute angles {0;, 0:1} between (15;, b; 1) and uj. This suggests that we can as- 
sociate with the operation of an adaptive filter a fictitious ray travelling from one medium to another. 
The magnitudes ||15;—1 || and ||%;|| play the role of refraction indices of the mediums, and the layer 
separating both mediums is along the direction perpendicular to uj. The angles (0;..1, 0;) play the 
role of the incidence and refraction angles of the ray. Of course, the dynamic nature of the adaptive 
filter update is such that the values of {||%#i~1]|, |||], @:-1, 0;) change with time, as well as the 
direction of the separation layer. 

Alternatively, we can interpret the result (15.46) graphically as shown in Fig. 15.4. An incident 
vector of norm ||tb;—1 || impinges on the separation layer at an angle 6:1 with uj, while a refracted 
vector of norm ||%;|| leaves wu; at an angle 0; with uj. The norms {||®;-—1 |}, ||@.||} and the angles 
(0i—1, 0) satisfy (15.46) so that the energy relation amounts to stating that the projections of these 
vectors along the horizontal direction have equal norms. 


Iris 


FIGURE 15.4  Snell's law interpretation of the energy-conservation relation (15.32). The norms 
of the horizontal projections should agree in view of (15.46). 


System-Theoretic Interpretation 


The energy relation (15.32) also admits a system-theoretic interpretation; it allows us to represent an 
adaptive filter as the interconnection of a lossless mapping and a feedback path — see Fig. 15.5. The 
interpretation is useful in studying the robustness properties of adaptive filters; as we shall do later 
in Part XI (Robust Filters). While we do not use this interpretation in this chapter, it is worthwhile 
presenting it here. 

First we remark that whenever two vectors, say, x and y, have identical Euclidean norms, i.e., 
whenever |z||? = ||u]|?, there should exist a lossless or energy-preserving mapping between them. 
In more precise terms, whenever |z||? = ||y||?, it can be shown that this is equivalent to the existence 
of a unitary matrix U that takes one vector to the other, say, z = Uy (so that ||a||? = y'U*Uy = 
lyl|?). We shall establish this result later in Sec. 33.3. The statement is obviously true for scalars 
since |x|? = |y|? implies z = yel? for some angle 9 € [—7,7]. In the vector case, the phase 
argument e? is replaced by a unitary matrix. Now note that if we introduce the two column vectors 


col [ibis Tilt) ep(i)} and col ios, VAO) eali)} 


then the energy relation (15.32) states that these two column-vectors have identical Euclidean norms. 
Therefore, there should exist an energy-preserving (or lossless or unitary) mapping between them. 
Actually, it is not hard to determine an expression for this mapping. Using ea(i) = uiWi—1 and 
(15.29) we can write 


di I- p(i)uiui H(t) ui Wi-1 
= (15.47) 
(i) ea(i) v Ai) ui 0 v A(t) ep(?) 
E 
ar 


where we are denoting the lossless mapping by 7 (see Prob. IV.21). Observe that 7 is solely depen- 
dent on the regression data u;. Furthermore, since, as explained in Sec. 15.2, the data {d(2), ui) 
can always be assumed to satisfy item (a) of model (15.16) for some noise sequence v(i), namely, 
d(i) = u;w° + v(i), then we can express e(i) as e(i) = ea (i) + v(i) (cf. (15.19)). In this way, 
the right-hand side of expression (15.25) is dependent solely on (e. (2), v(i), ui} so that ej (i) can 
be obtained from e, (i) via a feedback connection as shown in Fig. 15.5. 


KOL : 


FIGURE 15.5 Every adaptive filter of the form (15.23) can be represented as an interconnection 
of a lossless feedforward mapping and a feedback loop. 
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Performance of LMS 


W. now move on to illustrate the use of the variance relation (15.40) in evaluating 
the steady-state performance of adaptive filters. We start with the simplest of algorithms, 
namely, LMS. 


16.1 VARIANCE RELATION 


Thus, assume that the data (d(i), u;} satisfy model (15.16) and consider the LMS recur- 
sion 


wi = wii t uu;e(i) (16.1) 
for which 
gle(z)] = e(i) = e«(1) + v(i) (16.2) 
Relation (15.40) then becomes 
uE \|205 |]? lea (2) + v(i)|? = 2Re(E e*(i)[ea(t) + v(i)]). i— oo (16.3) 


Several terms in this equality get cancelled. We shall carry out the calculations rather 
slowly in this section for illustration purposes only. Later, when similar calculations are 
called upon, we shall be less detailed. 

To begin with, the expression on the left-hand side of (16.3) expands to 


pE [|ui|? ( les (3)? + |e (|? + eZ G)w (i) + ea (v (i) | 
pE [lui]? leali)? + ucE lu]? 
HE [wil les)? + wo2Tr(Ru) (16.4) 


HE |lu;||?|ea(é) + v)? 


where we used the fact that v(z) is independent of both v; and e, (i) (recall Lemma 15.1), 
so that the cross-terms involving (v(i), ez (i), u;} cancel out. We also used the fact that 


E|w;j?- Tr(R,) and El|v(i)? = o? 


Similarly, the expression on the right-hand side of (16.3) simplifies to 2E Je; (1)|?, which 
is simply 2C! 5 as i — oo. Therefore, equality (16.3) amounts to 


(MMS = 5 [Ellus|[2/ea(é)|2 + o2T(Ry)], as i> oo (16.5) 


This expression has been arrived at without approximations. Still, it requires that we evalu- 
ate the steady-state value of the expectation E ||u;||7|ea(i)|? in order to arrive at the EMSE 
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of LMS. Some assumptions will now need to be introduced in order to proceed with the 
analysis, even for this simplest of algorithms! 

We shall examine three scenarios. One scenario assumes sufficiently small step-sizes, 
while another relies on a useful separation assumption. The third scenario assumes regres- 
sors with Gaussian distribution. 


16.2 SMALL STEP-SIZES 


Expression (16.5) suggests that small step-sizes lead to small E|e,(¢)|* in steady-state 
and, consequently, to a high likelihood of small values for e,(2) itself. So assume p is 
small enough so that, in steady-state, the contribution of the term E ||2;||?/e,(7)|? can be 
neglected, say, 

E juif lea(i)|? « of Tr(Ru) 
Then, we find from (16.5) that the EMSE can be approximated by 


?T 
(MS = Boe T) (for sufficiently small y) (16.6) 


16.3 SEPARATION PRINCIPLE 


If the step-size is not sufficiently small, but still small enough to guarantee filter conver- 
gence — as will be discussed in Chapter 24, we can derive an alternative approximation for 
the EMSE from (16.5); the resulting expression will hold over a wider range of step-sizes. 
To do so, here and in several other places in this chapter and in subsequent chapters, we 
shall rely on the following assumption: 


At steady-state, ||u;||? is independent of e,(i) (16.7) 


We shall refer to this condition as the separation assumption or the separation principle. 
Alternatively, we could assume instead that 


At steady-state, ||u;l|l? is independent of e(i) (16.8) 


with e, (i) replaced by e(i). This condition is equivalent to (16.7) since e(i) = e,(i)+v(i) 
and ||1; ||? is independent of v(i) (as follows from Lemma 15.1). 

Of course, assumption (16.7) is only exact in some special cases, e.g., when the succes- 
sive regressors have constant Euclidean norms, since then ||u;||? becomes a constant; this 
situation occurs when the entries of u; arise from a finite alphabet with constant magnitude 
— see Prob. IV.2. More generally, the assumption is reasonable at steady-state since the 
behavior of e, (i) in the limit is likely to be less sensitive to the regression (input) data. 

Assumption (16.7) allows us to separate the expectation E ||u;||?|e; (1)|?, which appears 
in (16.5), into the product of two expectations: 


E (llul? -lea(é)|?) = (Elles?) - (Ele(2P) = T.C S, ioc (169) 


In order to illustrate this approximation, we show in Fig. 16.1 the result of simulating a 
20-tap LMS filter over 1000 experiments. The figure shows the ensemble-averaged curves 
that correspond to the quantities 


E (lul? -je()]?) and — (Eluil?) - (Eles()/?) 
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Ensemble-average curves: product of expectations vs. expectation of product 


Product of 
expectations 


2000 4000 6000 8000 10000 
Iteration 


FIGURE 16.1 — Ensemble-average curves for the expectation of the product, E (||u;||^ ea (1)|?) 
(upper curve), and for the product of expectations, (E ||u;||?) - (Ele«(1)]?) (lower curve), for a 
20-tap LMS filter with step-size u = 0.001. The curves are obtained by averaging over 1000 
experiments. 


It is seen that both curves tend to each other as time progresses so that it is reasonable to 
use the separation assumption (16.7) to approximate the expectation of the product as the 
product of expectations. 
Now substituting (16.9) into (16.5) leads to the following expression for the EMSE of 
LMS: 
M po2Tr( Ru) 
2 — pTr(Ru) 


This result will be revisited in Chapter 23; see the discussion following the statement of 
Thm. 23.3. 


(over a wider range of u) (16.10) 


16.4 WHITE GAUSSIAN INPUT 


One particular case for which the term E ||u; ||?|e,(¢)|? that appears in (16.5) can be eval- 
uated in closed-form occurs when u; has a circular Gaussian distribution with a diagonal 
covariance matrix, say, 


R,=021, o2>0 (16.11) 
That is, when the probability density function of u; is of the form (cf. Lemma A.1): 
TEN. {erzu} _ 1 {-llul?/03} 
fulu) = aM det Ru exp = (ra2)M exp 


The diagonal structure of E, amounts to saying that the entries of u; are uncorrelated 
among themselves and that each has variance c2. The analysis can still be carried out in 
closed form even without this whiteness assumption; it suffices to require the regressors 


to be Gaussian. Moreover, R, does not need to be a scaled multiple of the identity. We 
treat this more general situation in Sec. 23.1; see Prob. IV.19 for further motivation and the 


discussion following Thm. 23.3. 
In addition to (16.11), we shall assume in this subsection that 


At steady state, w;_1 is independent of uj (16.12) 


Conditions (16.11) and (16.12) enable us to evaluate E ||u;||?]e;(i)|? explicitly. Before 
doing so, however, it is worth pointing out that to perform this task, it has been common 
in the literature to rely not on (16.12) but instead on a set of conditions known collectively 
as the independence assumptions. These assumptions require the data {d(i), u;} to satisfy 
the following conditions: 

(i) The sequence {d(2)} is i.i.d. 


(ii) The sequence (u;) is also i.i.d. 

(iii) Each u; is independent of previous (d;, j < i). 
(iv) Each d(i) is independent of previous (u;, j « i]. 
(v) The d(i) and u; are jointly Gaussian. 


(vi) In the case of complex-valued data, the d(i) and u; are individually and jointly circu- 
lar random variables, i.e., they satisfy Eu] u; = 0, Ed?(i) = 0, and Eu} d(i) = 0. 


The independence assumptions (i)-(vi) are in general restrictive since, in practice, the se- 
quence {u;} is rarely i.i.d. Consider, for example, the case in which the regressors (u;) 
correspond to state vectors of an FIR implementation, as in the channel estimation ap- 
plication of Sec. 10.5. In this case, two successive regressors share common entries and 
cannot be statistically independent. Still, when the step-size is sufficiently small, the con- 
clusions that are obtained under the independence assumptions (i)-(vi) tend to be realistic 
— see App. 24.A. This may explain their widespread use in the adaptive filtering litera- 
ture. While restrictive, they provide significant simplifications in the derivations and tend 
to lead to results that match reasonably well with practice for small step-sizes. The key 
question of course is how large the step-size can be in order to validate the conclusions of 
an independence analysis. There does not seem to exist a clear answer to this inquiry in 
the literature. 

Condition (16.12) is less restrictive than the independence assumptions (i)-(vi). Ac- 
tually, assumption (16.12) is implied by the independence conditions. To see this, recall 
from the discussion in Sec. 15.2 that w;_1 is a function of the variables {w_1;d(i — 
1),...,d(0); ui 1,..., uo). Therefore, if the sequence u; is assumed i.i.d., and if u; 
is independent of all previous (d(j)) and of w. 1, then u; will be independent of w;— 
for all i. Note, in addition, that condition (16.12) is only requiring the independence of 
{W;-1, ui} to hold in steady-state; which is a considerably weaker assumption than what 
is implied by the full blown independence assumptions (i)-(vi). Moreover, assumption 
(16.12) is reasonable for small step-sizes u. Intuitively, this is because the update term in 
(15.24) is relatively small for small u and the statistical dependence of i$;..; on u; be- 
comes weak. Furthermore, in steady-state, the error e(i) is also small, which makes the 
update term in (15.24) even smaller. 


Remark 16.1 (Independence) In our development in this chapter we do nor adopt the indepen- 
dence assumptions (i)-(vi). Instead, we rely almost exclusively on the separation condition (16.7). 
It is only in this subsection, for Gaussian regressors, that we also use assumption (16.12) in order to 
show how to re-derive some known results for Gaussian data from the variance relation (15.40). 
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So let us return to the term E ||2;||7/e,(2)|? in (16.5) and show how it can be evaluated 
under (16.12), and under the assumption of circular Gaussian regressors. First we show 
how to express E ||; ||/e.(¢)|? in terms of E |e, (i)/? (see (16.17) further ahead). Thus, 
note the following sequence of identities: 


Elui lea? = E (wu; (uiri iib]. ur) 
E Tr(u;uju;tv; i$; ,u;) 
= ETr(ujuj; jw; ufu) 


= TrE(ufujiib; wuu) (16.13) 


where in the second equality we used the fact that the trace of a scalar is equal to the 
scalar itself, and in the third equality we used the property that Tr( AB) = Tr( BA) for any 
matrices A and B of compatible dimensions. 

We now evaluate the term E (ufu;ib; 15; ,ufju;), which is a covariance matrix. To 
do so, we recall the following property of conditional expectations, namely, that for any 
two random variables æ and y, it holds that Ea = E (E [c|y]) — see (1.4). Therefore, in 
steady-state, 


E (utujiv; d; uj ui) = E [E (uiui; i$ ufui | u;)] 
EE [uiui E (i; Lu | ui) užu] 
= E(ujuiCj_iuju,) (16.14) 


where in the last step we used assumption (16.12), namely, that w;.1 and u; are indepen- 
dent so that 
b^ e Ls aa A 
E (ùi | u;) = Ec; 10; , = Cii 


We are also denoting the covariance matrix of 4v;..; by Cj... We do not need to know the 
value of C;—1, as the argument will demonstrate — see the remark following (16.17). We 
are then reduced to evaluating the expression E uj u;C; .1u; u;. Due to the circular Gaus- 
sian assumption on u, this term has the same form as the general term that we evaluated 
earlier in Lemma A.3 for Gaussian variables, with the identifications 


z= w, WecCi Ac œI 


so that we can use the result of that lemma to write 
Euju;Ci iuju = o%[Tr(Cy-1)1+ Cii] (16.15) 
Substituting this equality into (16.13) we obtain 
Elw?e&G)? = TrlodTr(Ci_1)I+ o3Ci-1] = (M + 1)o4Tr(Ci-1) (16.16) 
Now repeating the same argument that led to (16.14), we also find that 


E|ea(i)|? 2 E(u;-;$; ,u)) = E(wCiiwl) 
E (uiCi-1u]) 
Tr E (uju;Ci i) 
= TrE(R,Ci-1) 
= oTrE (Cii) 


This expression relates Tr(C;_1) to E |e,(2)|?. Substituting into (16.16) we obtain 

E lju;||? leali)? = (M+ 1)07Eleg(2)|? (16.17) 
This relation expresses the desired term E ||u;||?|e,(i)|? as a scaled multiple of E |e, (i)|? 
alone — observe that C;—1 is cancelled out. Using this result in (16.5), we get 


22 
Qus = a rs (for complex-valued data) (16.18) 


The above derivation assumes complex-valued data. If the data were real-valued, then 
the same arguments would still apply with the only exception of Lemma A.3. Instead, we 
would employ the result of Lemma A.2 and replace (16.15) by 


Euju;Ci-iu]u; = os [Tr(Ci-1)] + 2C,-1] 


with an additional scaling factor of 2 (now C;_, = Eð; Ül). Then (16.16) and 
(16.17) would become 


E flu;|| e3 (i) = (M + 2)02 Tr(Ci-1) = (M + 2)e;E €3 (i) (16.19) 
and the resulting expression for the EMSE is 


272 
(M5 = saat cae (for real-valued data) (16.20) 


16.5 STATEMENT OF RESULTS 


We summarize the earlier results for the LMS filter in the following statement. A conclu- 
sion that stands out from the expressions in the lemma is that the performance of LMS is 
dependent on the filter length M, the step-size 4, and the input covariance matrix R,. 


Lemma 16.1 (EMSE of LMS) Consider the LMS recursion (16.1) and assume 
the data {d(i), ui} satisfy model (15.16). Then its EMSE can be approxi- 
mated by the following expressions: 


1. For sufficiently small step-sizes, it holds that C M5 = yo?Tr(R,)/2. 


2. Under the separation assumption (16.7), it holds that 
OMS = ug? Tr(Ry)/[2 — uTrQRA)] 
3. If uj is Gaussian with R, = c21, and under the steady-state assumption 
(16.12), it holds that 
CMS = uMoyou/(2 — p(M + »)o7) 


where y = 2 if the data is real-valued and y = 1 if the data is complex- 
valued and u; circular. Here M is the dimension of u;. 


In all cases, the misadjustment is obtained by dividing the EMSE by c2. 
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16.6 SIMULATION RESULTS 


Figures 16.2-16.4 show the values of the steady-state MSE of a 10-tap LMS filter for dif- 
ferent choices of the step-size and for different signal conditions. The theoretical values are 
obtained by using the expressions from Lemma 16.1. For each step-size, the experimental 
value is obtained by running LMS for 4 x 10° iterations and averaging the squared-error 
curve (le(i)|?) over 100 experiments in order to generate the ensemble-average curve. The 
average of the last 5000 entries of the ensemble-average curve is then used as the experi- 
mental value for the MSE. The data {d(i), u;) are generated according to model (15.16) 
using Gaussian noise with variance c2 = 0.001. 

In Fig. 16.2, the regressors (u;) do not have shift structure (i.e., they do not correspond 
to regressors that arise from a tapped-delay-line implementation). The regressors are gen- 
erated as independent realizations of a Gaussian distribution with a covariance matrix Ry 
whose eigenvalue spread is p — 5. Observe from the leftmost plot how expression (16.6) 
leads to a good fit between theory and practice for small step-sizes. On the other hand, as 
can be seen from the rightmost plot, expression (16.10) provides a better fit over a wider 
range of step-sizes. 

In Fig. 16.3, the regressors (w;) have shift structure and they are generated by feeding 
correlated data (u(i)) into a tapped delay line. The correlated data are obtained by filtering 
a unit-variance i.i.d. Gaussian random process {s(i)} through a first-order auto-regressive 


model with transfer function 
V1- a2/(1— az^1) 


and a = 0.8. It is shown in Prob. IV.1 that the auto-correlation sequence of the resulting 
process (u(i)) is 

r(k) = Eu(i)u(i — k) = al*! 
for all integer values k. In this way, the covariance matrix R, of the regressor u; is a 
10 x 10 Toeplitz matrix with entries (477,0 < i,j < M — 1). 

In Fig. 16.4, regressors with shift structure are again used but they are now generated 
by feeding into the tapped delay line a unit-variance white (as opposed to correlated) pro- 
cess so that R, = c?I with o2 = 1. This situation allows us to verify the third result in 
Lemma 16.1. It is seen from all these simulations that the expressions of Lemma 16.1 pro- 
vide reasonable approximations for the EMSE of the LMS filter. In particular, expression 


LMS: Gaussian regressors LMS: Gaussian regressors 
without shift structure without shift structure 


©- Simulation STE E 
—$- Theory (small u) | : DX 


10 10? 10 
Step-size Step-size 


FIGURE 16.2 Theoretical and simulated MSE for a 10-tap LMS filter with c2. = 0.001 and 
Gaussian regressors without shift structure. The leftmost plot compares the simulated MSE with 
expression (16.6), which was derived under the assumption of small step-sizes. The rightmost plot 
uses expression (16.10), which was derived using the separation assumption (16.7). 


LMS: Correlated Gaussian input LMS: Correlated Gaussian input 
with shift structure with shift structure 
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FIGURE 16.3 Theoretical and simulated MSE for a 10-tap LMS filter with o2 = 0.001 and 
regressors with shift structure. The regressors are generated by feeding correlated data into a tapped 
delay line. The leftmost plot compares the simulated MSE with expression (16.6), which was derived 
under the assumption of small step-sizes. The rightmost plot uses expression (16.10), which was 
derived using the separation assumption (16.7). 


(16.10), which was derived under the separation assumption (16.7), provides a good match 
between theory and practice. 


LMS: White Gaussian input LMS: White Gaussian input 
with shift structure with shift structure 


-e- Simulation 
—- Theory (small p) 


Step-size Step-size 


FIGURE 16.4 Theoretical (using (16.20) and simulated MSE for a 10-tap LMS filter with o2 = 
0.001 and regressors with shift structure. The regressors are generated by feeding white Gaussian 
input data with unit variance into a tapped delay line. 
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Performance of NLMS 


W. now illustrate the use of the variance relation (15.40) in evaluating the steady-state 
performance of the eNLMS algorithm, 


NS epu (17.1) 
for which (à) (i (à 
: a alt) + v(t 

gle(i)] = em = ee PIRE (17.2) 


The data (d(i), u;,v(i)) are assumed to satisfy model (15.16). The variance relation 
(15.40) in this case becomes 


uE [e = 2Re { Eez(i) (2+0), i= (113) 


(e + |ui]2? e+ [uil 
Again, several terms in this equality get cancelled. Since the arguments are similar to what 


we did for the LMS case in the previous chapter, we shall be brief and only highlight the 
main steps. 


17.1 SEPARATION PRINCIPLE 
D 


Note first that by expanding both sides of (17.3) we get 


er 2 ( juill? ) ( lea (i)? ) 

E | — x] + uE | ———-2E[-——— (17.4) 
Cee |u;]2)? Moet \ (e+ Jul? e+ Juil? 

In order to simplify this equation, we resort to the same separation principle (16.7) that we 


used in the LMS case, namely, that at steady-state, ||u; |? is independent of Je, (1)|?. Under 
this condition, equality (17.4) becomes 


evi? D2 + ug lul? Y 1 Me 
vE (cess) Eie + note (teil as) eoe (oce) tiem 


If we define the quantities (which are solely dependent on the statistics of the regression 
data): 
A Ilex? ) A ( 1 ) 
a, = E | ———— ], = E| ———. (17.5 
e E (mp m T E et Teale ; 
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then the above equality can be written more compactly as 
(2n, — HOu)E|ea(i)|? = pojou, i oo 


so that 2 
€-NLMS LAUT, 

= 17.6 

s 2T, — Lou ( ) 

When the regularization parameter e is sufficiently small, which is usually the case, then 

its effect can be ignored and the definitions of o, and 7, coincide, namely, a, = ny = 
E (1/||u;||?) In this case, expression (17.6) reduces to 


«-NuMs _ BO ; 
C = art (when « is small) (17.7) 
which is independent of the regression data. 

An alternative expression for the EMSE can be obtained by using the assumption e ~ 0 
in order to initially simplify (17.4) into 


i (Ger) + vett (ae) = = (ESH) 


Then we appeal to the following steady-state approximation, instead of the separation as- 
sumption (16.7), 
E (eer) , Eleai)? _ Ele)? 


PAE E jju = CAS as i — oo (17.8) 


and use it to find that the filter EMSE can also be approximated by 


2 
(£7 NLMS — Foe T(R,)E (va) ,  (whene is small) (17.9) 


-4 


r o o S 
Lemma 17.1 (EMSE of e-NLMS) Consider the e-NLMS recursion (17.1) and 
assume the data (d(i), uj) satisfy model (15.16). Then, under the separation 

assumption (16.7), its EMSE can be approximated by 


Ce-NLMS — uo; 
2 


=H 


(first approximation) 


or, under the steady-state approximation (17.8), 


(t7 NUMS _ uo; 
2-yu 


Tr(R,)E (va) (second approximation) 


The misadjustment is obtained by dividing the EMSE by c2. 
L—————————— ————————————————————————————ÓÓM——d 


The expressions for (*- "-M5 in the above statement reveal why the performance of 
€—NLMS is less sensitive to the statistics of the regression data than LMS. Observe, for 
example, that the expression (*- N.M5 = ,,52/(2 — ji) is independent of Ru. 
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17.2 SIMULATION RESULTS 


Figures 17.1 and 17.2 show the values of the steady-state MSE of a 10-tap e—NLMS filter 
for different choices of the step-size with e = 1079. The theoretical values are obtained 
by using the expressions of Lemma 17.1. For each step-size, the experimental value is 
obtained by running e~NLMS for 1.5 x 10? iterations and averaging the squared-error 
curve over 100 experiments in order to generate the ensemble-average curve. The average 
of the last 5000 entries of the ensemble-average curve is used as the experimental value for 
the MSE. The data (d(i), u;} are generated according to model (15.16) using Gaussian 
noise with variance c2 — 0.001. 

In Fig. 17.1, the regressors (u;) do not have shift structure and they are generated as 
independent realizations of a Gaussian distribution with a covariance matrix R, whose 
eigenvalue spread is p — 5. The range of step-sizes is between 0.01 and 1. Observe from 
the rightmost plot how expression (17.9) leads, in this case, to a good fit between theory 
and practice over a wider range of step-sizes, while expression (17.7) provides a good fit 
for smaller step-sizes as seen from the leftmost plot. 

In Fig. 17.2, the regressors (u;) have shift structure and are generated by feeding cor- 
related data (u(i)) into a tapped delay line. The correlated data are obtained by passing 
a unit-variance i.i.d. Gaussian random process (s(:)) through a first-order auto-regressive 
model with transfer function v1 — a?/(1 — az~!) and a = 0.8. In this simulation, it is 
seen that expression (17.7) results in a better fit between theory and practice when com- 
pared with expression (17.9). 


e-NLMS: Gaussian regressors e-NLMS: Gaussian regressors 
without shift structure without shift structure 
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FIGURE 17.1 Theoretical and simulated MSE for a 10-tap e—NLMS filter with c2 = 0.001 and 
Gaussian regressors without shift structure. The leftmost plot compares the simulated MSE with the 
value that results from (17.7), while the rightmost plot uses expression (17.9). 


17.A APPENDIX: RELATING NLMS TO LMS 


In addition to the treatment given in the body of this chapter, there are alternative ways to study 
the performance of e—NLMS. One such way is to reduce the e-NLMS recursion (17.1) to an LMS 
recursion via a suitable change of variables. Thus, introduce the transformed variables: 


(17.10) 


e-NLMS: Correlated Gaussian e-NLMS: Correlated Gaussian 
input with shift structure input with shift structure 
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FIGURE 17.2 Theoretical and simulated MSE for a 10-tap e—NLMS filter with c2 = 0.001 and 
Gaussian regressors with shift structure. The leftmost plot compares the simulated MSE with the 
value that results from (17.7), while the rightmost plot uses expression (17.9). 


Then the e—NLMS recursion (17.1) becomes 
Wi = Wi-1 + pin; é(i), where é(i) = d(i) — jWi-1 


which is simply an LMS recursion in the variables (d(i),à;). Moreover, given model (15.16) for 
{d(i), ui). it follows that the variables {d(i), à4;) satisfy a similar model with the same w°. More 
specifically, 


(a) There exists a vector w^ such that d(i) = ü;w^ + H(i). 

(b) The noise sequence {ŭ(i)} is orthogonal with č? = E |e(i)|? = e?E (1/(e + [|ui||?)) . 
(c) The noise sequence ŭ(i) is orthogonal to à; for all i 4 j. 

(d) The initial condition w—1 is independent of all {d(j), ùz, $(7)). 

(e) The regressors have Ry = Eüzü; = E (utui/(e + ||ui||?)) > 0. 

(f) The random variables ($(i),4;) are not necessarily zero mean (see Prob. IV.20). 
(17.11) 


The main differences in relation to model (15.16) are conditions (b), (c), and (f). This is because 
the sequence (2(i)) is not i.i.d. any longer but only orthogonal. In other words, it now satisfies 
Eé(i)6"(j) = 0 for all i, j. Moreover, #(2) is not independent of ŭ; since its definition involves 
u:i. Instead, $(i) is orthogonal to %:, i.e., it satisfies Edo" (i) = 0 for all 4, j. Still, the result 
of Lemma 15.1 will hold with the qualification "independent of" replaced by "orthogonal to", and 
with the variables (v(i), ea(j)} replaced by ($(i), és (j)), where ča (j) = üà;i;-i. Under these 
conditions, the derivation that led to (16.5) could be repeated, and it would lead to a similar relation 
of the form 
Č = u [E ùil ča (^ + a2Tr(Ru)], as i — oo 
This line of reasoning, however, would only allow us to evaluate the transformed variance E [és (oo)|?, 
as opposed to the desired variance E |es (co)|?. But it follows from the definitions of e, (i) and ča (i) 
that 
1 


e+ [nul 


so that, by appealing again to the same kind of approximation (17.8), we can relate E [eo (oo)|? to 
E |éa(00)|? as follows: 


dJes«G) = |ča(i)}? 


cis 


— É ;Y|2 ;j— 
CETUR Elés()))", i— oo (17.12) 
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Therefore, using the result of Lemma 16.1 for sufficiently small step-sizes we get 


i à. (i a z čz DET 
jim E és (i)]? = EN. TR) = el (E (m )) 


and using (17.12), we obtain 


x2 * 
¢—NLMS põle + Tr(Ru)) ( ( uiui )) 
: 2 Ve Ner Hus? 


However, when e © 0, Tr (E (u?ui/(e + ||ui||?))) = 1, in which case the expression for the 
EMSE becomes 
2 
e-NLMS _ 48v 1 
= —Tr(Ru)E | 0; 17.13 
à ETRE (rugs) ee 
This result agrees with the second expression in Lemma 17.1 for small u. We shall revisit the 
result in Probs. V.18 and V.20 without the above approximations for the expectations, but with an 
independence condition on the regressors (namely, condition (22.23)). 


CHAPTER 1 8 


Performance of Sign-Error LMS 


W. now illustrate the use of the variance relation (15.40) in evaluating the steady-state 
performance of the sign-error LMS algorithm, 


wi = wi_-1 wu;csgn|e(i)| (18.1) 
for which 
gle(i)] = csgn[e(i)] = csgn[ea(i) + v()] (18.2) 


where the data (d(i), u;, v(i)) are assumed to satisfy model (15.16). To proceed, we 
need to distinguish between real-valued data and complex-valued data. This is because 
the definition of the sign function is different in both cases. The final expressions for the 
EMSE, however, will turn out to be identical except for a scaling factor. 


18.14 REAL-VALUED DATA 


In this case the sign function is defined as 


1 r0 
esgn[z] = signz] = 4 -1  z«0 
0 z=0 


Using the fact that g?(z) = 1 almost everywhere on the real line, the variance relation 
(15.40) becomes 


uTr(R,) = 2Ee4(i)sign|le(?)] = 2E ea(i)sign[ea (i) + v(i), $i oo (18.3) 


In order to arrive at the value of the EMSE we need to evaluate the expectation on the 
right-hand side. For this purpose, we shall rely on the following assumption: 


The estimation error e(i) and the noise v(i) are jointly Gaussian (18.4) 


This condition is violated in general. For example, even if we assume that v(z) and w; in 
model (15.16) are Gaussian, so that d(i) is Gaussian as well, this assumption alone does 
not imply that the estimation error 


e(1) = d(i) — U;{Wi-1 


will be Gaussian. This is because e(7) depends nonlinearly on the data (u;, v()) through 
its dependence on w;_;. However, e(7) would be Gaussian conditioned on w;_1. In other 
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words, if we assume w;. ; is constant, then e(i) will be Gaussian. This observation sug- 
gests that when the step-size is sufficiently small, so that the weight estimator w,_ varies 
slowly with time, assumption (18.4) could be reasonable. This argument also suggests that 
we could replace (18.4) by the following: 


u(i) and v(i) are Gaussian and y is sufficiently small (18.5) 


To illustrate the Gaussianity assumption on e(i), we plot in Fig. 18.1 a histogram of the 
distribution of e(i) at the time instant i = 99000. The histogram is obtained by simulat- 
ing a 20-tap sign-error LMS filter over 400 experiments and using y = 5 x 1079. The 
input to the tapped-delay-line filter is correlated data obtained by passing a unit-variance 
i.i.d. Gaussian random process through a first-order auto-regressive model with transfer 
function given by 


vl1-—a? 7 
(i —az-1)’ a=0.8 


The histogram suggests that the Gaussianity assumption on the distribution of e(2) is rea- 
sonable. 

The motivation for introducing assumption (18.4) is that it allows us to express the 
expectation 


Ee,(i)sign{[e, (i) + v(i)] 
in terms of 
Eeq(i)[ea(i) + v(i)] 


with the sign function removed. This is achieved by resorting to a special case of a result 
known as Price’s theorem (see Prob. IV.10). The special case we are interested in states 


Distribution of the realizations of the error signal at teration 99000 over 400 experiments 
70.— -—-—— T —X— ——4 1 


Number o! occurences 


1 -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 
Values assumed by e(i) at iteration «99000 


FIGURE 18.1 Histogram of the distribution of e(i) at time instant i = 99000. The graph is 
obtained from 400 runs of a 20-tap sign-error LMS filter with step-size y = 5 x 107° and correlated 
input data. 


that for any two real-valued zero-mean and jointly Gaussian random variables a and b, it 
holds that 


1 
Easign(b) = "E = Eab, where of = Eb? (18.6) 
b 


Now, in view of assumption (18.4), we have that e; (7) and e(7) are jointly Gaussian. Using 
Price’s theorem with the identifications 


a eli), beeli), of Ee?(i)+0?, Eab< Ee?(i) 


and substituting into (18.3), we get 


Tr(R,)=2 2. Ee) i oo (18.7) 


Solving for 
eases LMS 4 E e2(oo) 
we find that 


Csign—error LMS — S (a + ya? + m) , where a= Js pIr(Ry) (18.8) 


18.2 COMPLEX-VALUED DATA 


When the data are complex-valued, the definition of the csgn function is 
csgn[z] = sign(z,) + jsign(a;) 


where = x, + jx; — see Prob. IIL15. Using the fact that now |g(z)|? = 2 almost 
everywhere in the complex plane, the variance relation (15.40) becomes 


2uTr(R,) = 2Re(E ež (i)csgn[ea(i) + v(i)]), i— œ (18.9) 
Similarly to assumption (18.4) in the real case, we also need to introduce certain Gaussian 
assumptions on (e(i), v(i) }:° 


1. The real parts of e(i) and v(i) are jointly Gaussian. 

2. The imaginary parts of e(i) and v(i) are jointly Gaussian. 

3. The real and imaginary parts of e(i) have identical variances. 

4. The real parts of (e(i), v(i)) are independent of their imaginary parts. 


Then if we employ the extension of Price's theorem to complex data (as given by Prob. IV.11), 
we can rewrite (18.9) as follows: 


2uTr(Ru) = im Re(E ez (i)|ea (1) + v(1)]) 
2 vV2Ele«()P 


T J/Ele«(i)|? + o2 ee) 


6For item (3), recall that if a complex-valued random variable æ is circular, then its real and imaginary parts have 
equal variances, as can be verified from the defining property Ex? = 0. 
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which has a form similar to (18.7). Hence, we are led to the following conclusion. 


Lemma 18.1 (EMSE of sign-error LMS) Consider the sign-error LMS recur- 
sion (18.1) and assume the data (d(i), u;) satisfy model (15.16). Assume 
further that (v(), u;) are Gaussian and that the step-size is sufficiently small, 
as indicated by (18.5). Then its EMSE can be approximated by 


(HE SO LMS — = (a +va? + m where a= Vu uTr( Fa) 


with y = 1 for real-valued data and y = 2 for complex-valued data. The 
misadjustment is obtained by dividing the EMSE by c2. 


18.3 SIMULATION RESULTS 


Figures 18.2 and 18.3 show the values of the steady-state MSE of a 5-tap sign-error LMS 
filter for different choices of the step-size and for different signal conditions. The theo- 
retical values are obtained by using the expression from Lemma 18.1. For each step-size, 
the experimental value is obtained by running the algorithm for 5 x 10? iterations and 
averaging the squared-error curve over 100 experiments in order to generate the ensemble- 
average curve. The average of the last 10000 entries of the ensemble-average curve is used 
as the experimental value for the MSE. The data {d(i), u;} are generated according to 
model (15.16) using Gaussian noise with variance c2 = 0.001. 

Figure 18.2 illustrates the situation where the regressors do not have shift structure 
and they are generated from a Gaussian distribution with a covariance matrix R, whose 
eigenvalue spread is p = 5. In Fig. 18.3, the regressors have shift structure and they are 
obtained for the top plot by feeding the filter with correlated data obtained by passing a 
unit-variance i.i.d. Gaussian random process through a first-order auto-regressive model 
with transfer function V1 — a2/(1 — az^1) and a = 0.8. For the bottom plot of Fig. 18.3, 
the input data is white with unit variance. 


Sign-error LMS: Gaussian regressors without shift structure 
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FIGURE 18.2 Theoretical and simulated MSE for a 5-tap sign-error LMS filter with o2 = 0.001, 
and Gaussian regressors without shift structure. 


-28.5 
S | 
kJ 
üu -29 
[^] 

z 


-© Simulation 
—}- Theory 


Sign-error LMS: Correlated Gaussian input with shift structure 


-29.5 
-30° 
10° 10* 10° 
Step-size 
Sign-error LMS: White Gaussian input with shift structure 
-© Simulation : a ob : : 
-29.6 -9- Theory : UN tte yaks Sd a ding ett tate oe eee aay 


FIGURE 18.3 Theoretical and simulated MSE for a 5-tap sign-error LMS filter with c2 = 0.001, 
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and Gaussian regressors with shift structure. 
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Performance of RLS and Other Filters 


W. now examine the performance of the RLS algorithm and comment on the perfor- 
mance of several other adaptive filters. 


19.1 PERFORMANCE OF RLS 


We consider the RLS algorithm of Sec. 14.1, namely, 


A Pi iupuuPia 
1+ A-cluPi uf 
Wi = W-it P; u; (d(i) = UiWi-1], i > 0 (19.2) 


P; = MÓPu- (19.1) 


with initial condition P.., = e^! Land 0 «& A < 1. The scalar e is a small positive number 
and A is usually close to one. We are using a boldface letter P; to indicate that P; isa 
random variable due to its dependence on the regressors (u;). Also, recall from (14.3) 
and (14.5) that 

t 


p! = Aie + $3 Mui; (19.3) 


j-0 


which shows that P; > 0 for all finite 7. 

Compared with the update recursion (15.23), which was seen to be useful in studying 
the performance of several adaptive filters in the previous sections, we now find that the 
RLS update differs in a special way; it includes the matrix factor P;, which appears multi- 
plying už from the left in (19.2). Still, the energy-conservation approach of Sec. 15.3 can 
be extended to treat this case, and even some other more general cases (see, e.g., Prob. IV.9 
as well as Part V (Transient Performance) and the problems therein). Since the arguments 
in the sequel are similar to what we encountered in Sec. 15.3 while deriving the energy 
relation (15.32), we shall be brief and only highlight the ideas that are particular to RLS. 


Energy-Conservation Relation 


To begin with, the update recursion (19.2) can be rewritten in terms of the weight error 
vector Ww; = w? — w; as 


(19.4) 


If we multiply both sides of this recursion by u; from the left we find that the a priori and 
a posteriori estimation errors are related via: 


€p(i) = e«(i) — lui, eli) (19.5) 
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where 


(19.6) 


and where the notation |||2, stands for the squared-weighted Euclidean norm of a vector, 
namely, for a column vector z, |z||?; = z*Wz. 

To extend the energy relation to this case, we combine (19.4)-(19.5) and proceed exactly 
as in Sec. 15.3. For example, when u; # 0, we use (19.5) to eliminate e(i) from (19.4). 
This calculation leads to 


T Piui ; = Piu : 
wi — eli) = dia + — epli) 
"o Jule, ^ i lulh, * 


By equating the squared weighted norms on both sides of this equality, using P; lasa 
weighting matrix, we arrive at 


lé. SAWP Malka FAO] 097 
id luit, if us # 

-n A a2 l/|'willp, ifu; #0 

ni) (ip)! = | Viel, Puro 194) 


The equality (19.7) extends the energy-conservation relation (15.32) to the RLS context. 
Observe that the main distinction is the appearance of the weighting matrices {P7}, P;}: 
the former is used as a weighting factor for the weight-error vectors while the latter is used 
as a weighting factor for the regressor. 

We are interested in evaluating the performance of RLS in steady-state. To do so, we 
shall call upon the steady-state condition (cf. Def. 15.1), 


Eww; = Ewviv;,-0C, as i— oo (19.9) 


in order to transform (19.7) into a variance relation (as was done in Sec. 15.4). How- 
ever, the presence of the matrices {P;, P; !) in (19.7) makes this step challenging; this 
is because the matrices ( P;, P; +} are dependent not only on u; but also on all prior re- 
gressors (u;, j € i}. For this reason, in order to make the performance analysis of RLS 
more tractable, whenever necessary, we shall approximate and replace the random vari- 
ables (P; !, P;} in steady-state by their respective mean values. In a sense, this approx- 
imation amounts to an ergodicity assumption on the regressors. For an ergodic random 
process, the time-average coincides with the ensemble average. What this means in the 
context of RLS is the following. Observe from expression (19.3) that, as i — oo, P; ! can 
be regarded as a weighted sum (or a time-average) of infinitely many terms of the form 
u;u,. Assuming these terms are realizations of a process u that is ergodic in its second 
moment, then we can approximate P; 5 by its mean value. 


Steady-State Approximation 
To begin with, note from (19.3) that 


Pb = tte [ufui + Auju +... + A tutu, + Abuguo] (19.10) 


so that, as i — oc, and since A < 1, the steady-state mean value of P; ! is given by 


(19.11) 
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We denote the result by P^!. The mean value of P, on the other hand, is considerably 
harder to evaluate. So we shall satisfy ourselves with the approximation 


EP, ~ [EP] = (1-JR; = P, s ioo —— (19.32) 


This is an approximation, of course, because even though P; and P; 1 are the inverses of 
one another, it does not hold that their expected values will have the same inverse relation. 
Consider, for example, a random variable x that is uniformly distributed inside the interval 
[1, 2]. Its mean is 3/2. On the other hand, the random variable y = 1/a has mean value 
equal to In 2. Still, approximation (19.12) is generally reasonable for Gaussian regressors. 
In order to illustrate this fact, we plot in Fig. 19.1 a curve showing how the matrices 
E P; and (1 — A) R;! compare for different forgetting factors, The curve in the figure 
is generated by running an RLS filter of order 5 over 2000 iterations and averaging the 
results over 50 experiments. The regressors are chosen as independent realizations of a 
Gaussian distribution with a covariance matrix Fè, whose eigenvalue spread is p = 5. The 
steady-state value of E P; is estimated by averaging the last 200 values of P; in each run 
over all experiments. The curve plots the following relative measure of closeness between 
E P; and (1 — A) R;!: 
a JEP: - 1-X)R'I 
I1 — A)RZ 1 

in terms of the ratio of the norm of the difference to the norm of (1 — A) RI}. The matrix 
norm used here is the maximum singular value of the matrix — see Sec. B.6; other matrix 
norms can be used as well. It is seen from the figure that there is a good match between 
E P; and (1 — A) R77, especially for forgetting factors that are close to one. 

For example, for À = 0.995, the following steady-state value for E P; was estimated in 
this simulation by means of ensemble averaging: 


5.1176 x x x x 
x 1.8147 x x x 
EP; = 107. x x 48365 x x 
x x x 1.1968 x 
x x x x 1.0239 


with relatively small off-diagonal elements, while the value of (1 — A) RJ! was 
(1- R4! = 107?. diag{5.0, 1.8, 4.7, 1.2, 1.0} 


The smaller the value of x the better the approximation 


x (%) 
o 


0.98 0.985 0.99 0.995 
Forgetting factor (A) 


FIGURE 19.1 A plot of the relative difference between E P; and (1— )Rz’ for different values 
of A and for Gaussian regressors. 


Filter Performance 


Returning to the energy-conservation relation (19.7), replacing P; ! by its assumed mean 
value P71, we find that E lie; in steady-state evaluates to 


Elöl- © Elöl- = T(CP-!) 
Likewise, in view of the steady-state assumption (19.9), 
~ 002 n ~- | -1 
El&i-ilg-: ~ E|Wwi-ile-: = Tr(CP~*) 
In other words, we find that in steady-state 


El: = Elfiiillz-. (19.13) 


P! 


This steady-state condition is the extension of (15.35) to the RLS context. 
Now taking expectations of both sides of (19.7), and using (19.13), we arrive at 


Ea(ile.()l = Ep(i)lep(i)|?, asi oc (19.14) 


However, from (19.5) we know how e,(i) is related to e; (1). Substituting into (19.14) we 
get 
E ; D. i A12 ; 
Ezn(i)e«(i)? = E(t) |ea(i) — lluilib,e(i)| , asi oc (19.15) 


which upon expansion and simplification, reduces to 
E || wil, |e(2)|? = 2Re(Eez(i)e(i)) asi— oo (19.16) 


This variance relation is the extension of (15.40) to the RLS case. Now, since the data 
(d(i), uj) satisfy conditions (15.16), we also have that e(?) = e, (i) + v(i), and (19.16) 
becomes 


o5E ||uill, + Elluill>,lea(i)? = 2E]es())? = 2085, i— oo (19.17) 


To proceed we resort to a separation condition similar to (16.7), namely, we assume that 


At steady-state, ||u;|5, is independent of e,(i) (19.18) 


This condition allows us to separate the expectation E ||u; ||}, |ea ()|? into the product of 
two expectations, i.e., 


E (luli, : les (G5) = (Elluille;) > (Elea GI?) 
If we now replace P'; by its assumed mean value, we obtain the approximation 

Eluilb, ~ Eluwll? = T(R,P) = (1-3)M (19.19) 
Substituting into (19.17) we get 


2(1— A)M 
(RIS — PUT (19.20) 
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Usually, the value of A is very close to one, so that 


c2(1— AÀ)M 


2 (19.21) 


cua eS 


m 

Lemma 19.1 (EMSE of RLS) Consider the RLS recursion (19.2)-(19.1) and 
assume the data {d(i), u;} satisfy model (15.16). Introduce further the 
approximations (19.12) and (19.19). Then, under the separation condition 
(19.18), the EMSE of RLS can be approximated by 


(RS _ ol-NM č (RS c; - XM 
2—(1-3)M 2 


The misadjustment is obtained by dividing the EMSE by c2. 


Rue eur A D M eI a 


A conclusion that stands out from the expression in the lemma is that the performance 
of RLS is independent of the input covariance matrix R4. [The mean-square performance 
of RLS is further studied in Probs. IV.15 and V.36.] 


Simulation Results 
Figure 19.2 shows the values of the steady-state MSE of a 5-tap RLS filter for different 
choices of the forgetting parameter A. The theoretical values are obtained by using the 
expressions from Lemma 19.1. The experimental value is obtained by running the algo- 
rithm for 2000 iterations and averaging the squared-error curve over 300 experiments in 
order to generate the ensemble-average curve. The average of the last 1000 entries of the 
ensemble-average curve is used as the experimental value for the MSE. The data (d(i), w;) 
are generated according to model (15.16) using Gaussian noise with variance c2 — 0.001. 
Two situations are illustrated in the figure. For the top plot, the regressors do not have 
shift structure and they are generated as independent realizations of a Gaussian distribution 
with a covariance matrix R, whose eigenvalue spread is p = 5. For the bottom plot, the 
regressors have shift structure and they are obtained by feeding the filter with a correlated 
input process that is obtained by passing a unit-variance i.i.d.. Gaussian random process 
through a first-order auto-regressive model with transfer function v1 — a?/(1—az~1) and 
a = 0.8. 


19.2 PERFORMANCE OF OTHER FILTERS 


The same line of reasoning used so far in Part IV (Mean-Square Performance) is used in 
the problems, and also in Part V (Transient Performance), to study the mean-square per- 
formance of several other adaptive filters. For ease of reference, we list here the locations 
in the text that study some of these filters. 

1. Leaky-LMS. The leaky-LMS filter is described by the recursion (see Alg. 12.2): 


Ww; = (1 _ pajwi—ı + uu; (d(i) = Uii). i20 
which looks similar to the LMS recursion (16.1) except for the presence of the factor 


a. The mean-square performance of this algorithm is treated in Probs. V.30 and V.32, 
with Prob. V.30 providing an expression for the EMSE of the filter when R, = 
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FIGURE 19.2 Theoretical and simulated MSE for a 5-tap RLS filter with o2 = 0.001, and 
Gaussian regressors with and without shift structure. 


c?1 and Prob. V.32 providing an expression for the EMSE of the filter when the 
regressors are circular Gaussian. In addition, Prob. V.29 studies the mean-square 
performance of a more general class of leaky filters. 


2. e-NLMS with power normalization. The algorithm is described by the recur- 
sions: 


Bpli—1) + (1—-8)w()?, p(-1) =0 (19.22) 
——— ut e(i) (19.23) 


p(t) 

Wi = Wit 

where 3 is a positive number in the interval 0 < @ < 1. Compared with the up- 
date equation for e-NLMS in (17.1), we see that the squared-norm of the regressor, 


|; ||?, is replaced by the power estimate p(i). The mean-square performance of the 
algorithm is treated in Prob. IV.5 for Gaussian regressors with shift structure. 


3. LMF and LMMN. The LMMN algorithm is described by (cf. Alg. 12.4): 
wi —wi.;-pule(i)jó-(-S8Je()?] 0<8<1 (19.24) 


which employs a linear combination of the e(i) and e(i)je(i)|?. The least-mean 
fourth (LMF) algorithm corresponds to the special case 6 = 0, 


wi = wii + pute(i)je(i)|? 


The mean-square performance of these algorithms is treated in Probs. IV.6. 
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UWi-14K d(i 1-4 K) 
ei = di = UiWi-ı (19.26) 
wi = Wwi-ı+puU} (d+ UD ei i20 (19.27) 


where e is a small positive number and K < M. The energy conservation approach 
of Sec. 15.3 is extended to this case and the mean-square performance of the algo- 
rithm is studied in Prob. IV.7. 


5. Sign-regressor LMS. The sign-regressor LMS filter is described by the recursion 
w;-—w;a-cp:csgn(uj|:(d(i) - uw; il i20 


Compared with the sign-error LMS recursion (18.1), we see that the sign function is 
now applied to the regressor as opposed to the error signal, e(¢) = d(i) — uiwii. 
The mean-square performance of this algorithm is studied in Prob. V.25. 


6. CMA. The CMA2-2 algorithm is described by the recursion 


wi = wi uujz()ly lP] 2@ = uw i20 
where y is a positive scalar (see Alg. 12.6). This algorithm can be viewed as a special 
case of the more general recursion 


w; = wii uu; g|z(i)) 


where g[z(i)| is an arbitrary function of the filter output, z(i). In Chapters 15-18, 
we focused on adaptive filters of the form (15.23), which used gle(i)] instead of 
g[z(i)]. Of course, g[e(i)| is a function of z(i) since e(;) = d(i) — z(i) The 
above formulation with g[z(i)] is more general and it accommodates other filters, 
such as CMA2-2. The derivations in Secs. 15.3 and 15.4 still apply to this broader 
formulation with g[z(1)]. 


The main issue that arises while studying CMA2-2, in contrast to the adaptive filters 
already studied in this part, is the absence of a reference sequence d(i) and, cor- 
respondingly, of an explicit weight vector w° relative to which we may define the 
weight-error vector, Ùw; = w° — w;. This issue is addressed in Probs. IV.16-IV.17, 
where expressions for the mean-square performance of CMA2-2 are derived for both 
real- and complex-valued data. Problem IV.18 extends the results to CMA1-2, which 
is described by the following recursion (see Alg. 12.5): 

Wi = Wi- + pu} EAU e z(i) = uwi i0 

i = Wi- pu; Teo , = UQWj-1; Lad 


19.3 PERFORMANCE TABLE FOR SMALL STEP-SIZES 


We find it useful to list in Table 19.1 the EMSE of several adaptive filters under the assump- 
tion of sufficiently small step-sizes. The expressions in the table are obtained as approxi- 
mations from the results derived in this chapter and they serve as convenient references for 
comparing algorithms. The table also lists points of reference for each algorithm. 


TABLE 19.1 Approximate expressions for the excess mean-square performance of several 

adaptive filters for sufficiently small step-sizes. 
ee ME — 

Algorithm EMSE Reference 


LMS pozTr(Ru)/2 Lemma 16.1 


uo? 
e-NLMS Z— TE [e Lemma 17.1 


u(1 + 8)Mc 


e-NLMS with 231-8) - uM » B) 


Problem IV.5 


power normalization 


sign-error LMS E (a+ Vo? 4-402), o — T uTr(R,) | Lemma 18.1 


LMF u£S Tr(R.)/202 Problem IV.6 

LMMN pa Tr(R,)/2b | Problem IV.6 
2 

leaky-LMS Hee TR? (Ry + oT)7!] Problem V.32 

n 2 8 

sign-regressor LMS uc; Mf zer uM Problem V.25 
2 

e-APA 7v Tr(R,)E | ie ; ) Problem IV.7 

2—Hu Ui 
RLS o2(1—A)M/2 Lemma 19.1 


2) 0)2 4 6 
CMA2-2 " EO oue tn Tr(Ru) Problem IV.17 
aE Ole) c nnl v 


CMAI-2 B (Y - Els? - 2yE|s|) Tr(Ru)/2 Problem IV.18 
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A fundamental feature of adaptive filters is their ability to track variations in the un- 
derlying signal statistics. This is because by relying on instantaneous data, the statistical 
properties of the weight vector and error signals are able to react to changes in the input 
signal properties. The purpose of this chapter, and the next one, is to characterize the 
tracking ability of adaptive filters for nonstationary environments. 

We shall continue to rely on the energy-conservation framework of Chapter 15 and use 
it to derive expressions for the excess mean-square error of an adaptive filter when the input 
signal properties vary with time. The presentation will reveal that there are actually minor 
differences between mean-square analysis and tracking analysis and that, in particular, 
tracking results can be obtained almost by inspection from the mean-square results of the 
prior chapters. 


20.1 MOTIVATION 


In order to motivate our setup for tracking analysis, we start by reviewing the basic linear 
least-mean-squares estimation problem of Sec. 15.1. Thus, let d and u be zero-mean 
random variables with second-order moments 


Eld? =0}, Edu'-R&, Ewu-HRH,»20 
The coefficient vector w° that estimates d from u optimally in the linear least-mean- 


squares sense, i.e., the vector that solves 


min E |d — uu]? (20.1) 
w 


is given by w° = R] ' Rau. The corresponding minimum cost is 
Jmm =04 — RuaRg Ras = E |d — uw?|? (20.2) 


In Chapter 10 we developed stochastic-gradient algorithms (i.e., adaptive filters) for ap- 
proximating w°. These algorithms rely on data (d(i), u;) with moments (o2, Rau, Ru} 
and use update equations of the form, say, for the case of an LMS implementation, 


wi = wi-1 + pu;e(i) w- = initial condition 


or some other update form. In Chapter 15 we evaluated the performance of such filters by 
measuring the excess mean-square error that is left in steady-state, namely, by computing 
the difference 

EMSE = lim Ele(i)|? — Jai, (20.3) 


1—0O0 
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where 
e(i) = d(i) — u;wWi-i (20.4) 


is the estimation error that results from the adaptive implementation. 
Now if the moments (c2, Rau, Ru} vary with time, say, if 


Eld(i) 2962,  Ed(iu?- Ra,  Eufu;-R, (20.5) 


then the optimal weight vector w? will also vary with time. Specifically, the coefficient 
vector for estimating d(i) from u; in the linear least-mean-squares sense will be given by 


w? = Rij Rau? (20.6) 
with minimum cost 
Jmin(i) = 034 — Rua Ry Ra; = E|d(i) - uuj|? (20.7) 


If the ( Rau, i, Rui} vary slowly with time, then it is justifiable to expect that an adaptive 
filter will have sufficient time to track the optimal solution w?. If, on the other hand, the 
moments ( Fa, ;, Ri) vary rapidly with time, then this task becomes challenging (and, at 
times, impossible). 

The purpose of a tracking analysis is to quantify how well an adaptive filter performs un- 
der such changing conditions in the signal statistics. In order to make the analysis tractable, 
it is customary to assume that the statistics of the data vary in a certain manner rather than 
arbitrarily. For instance, the model that we shall adopt in the next section assumes that Rọ 
and Jinin remain fixed, while only (02, Rau} may vary with time. 


20.2 NONSTATIONARY DATA MODEL 


Thus, recall from the discussion in Sec. 15.2 that any given data {d(i), u;) can be assumed 
to be related via a linear model of the form 


d(i) = ww? + v(i) (20.8) 


where w? is the coefficient vector (20.6) that estimates d(i) from u; optimally in the lin- 
ear least-mean squares sense. Moreover, v(i) is uncorrelated with u; and has variance 
c2(i) = Jmin(i). However, as was also the case in Sec. 15.2, we shall impose the stronger 
assumption that 


This condition on v(i) is an assumption because the signal v(i) in the model (20.8) is 
only uncorrelated with u;; it is not necessarily independent of u;, or of all (u;) for that 
matter. Moreover, the variance of v(i) is not constant. Still there are situations where con- 
ditions (20.8) and (20.9) hold simultaneously, e.g., in the channel estimation application of 
Sec. 10.5. In that application, it is usually justified to expect the noise sequence {v(i)} to 
be i.i.d. and independent of all other data, including the regression data. 


Random-Walk Model 


In addition to the model (20.8)-(20.9) for the data (d(i), u;}, we shall also adopt a model 
for the variations in the weight vector w?. It is more convenient to adopt a model for the 
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variation in w? than a model for the variations in the statistics (2, Raw}. One particular 
model that is widely used in the adaptive filtering literature is a first-order random-walk 
model. The model assumes that w? undergoes random variations of the form 


(20.10) 


with q; denoting some random perturbation that is independent of (u;, v(j)} for all i, j. 
Observe that we are now using boldface letters for {w?, w?_,}. This is because they 
become random variables due to the presence of the random quantity g;. The sequence 
(qi) is assumed to be i.i.d., zero-mean, with covariance matrix 


ani) 
It is easy to see from (20.10) that 

Ew; = Ewi (20.12) 
so that the (10?) have a constant mean, which we shall denote by w°, 


A 
Ew? = w? 


The initial condition for model (20.10) is modeled as a random variable w? ,, with mean 
w? and independent of all other variables, (q;, v(i), wi} for all i. 


How Appropriate is this Model? 


Although widely adopted in the literature, model (20.10) is not necessarily meaningful. 
For one thing, the covariance matrix of w? grows unbounded. To see this, observe from 


w? — w? =w -w +q; (20.13) 


that 
E(w? — w°)(w? - w°)" = E(wi-w^)(wii-w?) +Q 


This means that, at each time instant 7, a nonnegative-definite matrix Q is added to the 
covariance matrix of w?_, in order to obtain the covariance matrix of w?. As a result, the 
covariance matrix of w? becomes unbounded as time progresses. A more adequate model 
for tracking analysis would be to replace (20.10), or equivalently (20.13), by 


(w? — w°) =a(w?_, — w?) +q; (20.14) 


for some scalar ja} < 1. In this case, the covariance matrix of w? would tend to a finite 
steady-state value given by 


jim E(w? — w?)(w? - w^) = Q/(- lal?) 


Still, it is customary in the literature to assume that the value of o is sufficiently close 
to one, and to use model (20.10). The main reason for assuming o 1 is to simplify the 
derivations during the tracking analysis. It is for this reason that we shall also proceed with 
the simple (yet contrived) model (20.10) in the body of the chapter in order to illustrate the 
key concepts. However, in the problems, and especially Probs. IV.29-IV.33, we extend the 


analysis to models of the form (20.14), which are rewritten in Prob. IV.29 in the equivalent 


form ERG 
w? w" + 0i 
{ = ead VER p 
Actually, Prob. IV.29 examines a more general model that also incorporates the effect of 
what we refer to as cyclic nonstationarities. Such nonstationarities arise, for example, as 
the result of carrier frequency offsets between transmitters and receivers in digital commu- 
nication systems. In the computer project at the end of this part, we provide an example 
that shows how models of the form (20.14) arise in applications by studying the problem 
of tracking a Rayleigh fading channel in a wireless communications environment. 


Data Model 

To summarize, we shall adopt the following model in our study of the tracking performance 
of adaptive filters in the body of the chapter. Specifically, we shall assume that the data 
(d(i), u;} satisfy the following conditions: 


a) There exists a vector w? such that d(i) = ujw? + v(i). 

b) The weight vector varies according to w? = w? , + qj. 

c) The noise sequence (v()) is i.i.d. with constant variance o? = E|v(i)|?. 

(d) The noise sequence (v(i)) is independent of u; for all i, j. 

e) The sequence q, has covariance Q and is independent of (v(j), u;) for all i,j. 
f) The initial conditions {w_1, w? , ) are independent of all (d(7), uj, v(i), 4; }- 
g) The regressor covariance matrix is denoted by R, = Eu; u; > 0. 

h) The random variables (d(i),v(i),u;,q;] are zero mean. 


i) The weight vector w? has constant mean w°. 
(20.16) 


We refer to these conditions as describing a nonstationary environment. Observe that in 
this model, the covariance matrix of the regression data is assumed to be constant and equal 
to Ra; likewise for the (co-)variances of the noise components {v(i}, q;). The constancy 
of c? means that, although w? is varying with time, the problem of estimating d(i) from 
u; optimally in the linear least-mean-squares sense is such that it has a constant minimum 
cost for all 7, 

Jan = 02 (20.17) 
This is because, as explained in the beginning of Sec. 15.2, v(i) plays the role of the esti- 
mation error that results from estimating d(i) from u;. 


Useful Independence Results 

A useful consequence of model (20.16) is that at any particular time instant 2, the noise 
variable v(i) is independent of all previous weight estimators (w;, j < i}. This fact 
follows easily from examining the update equation of an adaptive filter. Consider, for 
instance, the LMS recursion 


w; = wi;-pu;d(i)- wiwi-i] w- = initial condition 


By iterating this recursion we find that, for any time instant j, the weight estimator w; is 
a function of 1». 1, the reference signals (d(j), d(j — 1),...,d(0)}, and the regressors 
(uj, uj-1,..., Uo}. We can represent this dependency generically as 


w; =F [w_1; d(j),...,d(0); uj,..., uo | (20.18) 
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for some function F. A similar dependency holds for other adaptive schemes. 

Now v(i) is independent of each one of the terms appearing as an argument of F in 
(20.18) so that v(i) is independent of w; for all j < i. The independence of v(i) from 
{w_1,U;,...,Uo} is obvious by assumption, while its independence from {d(j),..., d(0)) 
can be seen as follows. Consider d(j) for example. Then from 


d(j) = ujwj +o) = uj («s 2» + v(j) 
k=0 


we see that d(7) is a function of (u;, v(), w21, qo, . .., q; }, all of which are independent 
of v(i) for j < i. We therefore conclude that v(i) is independent of w; for all j < i. 

It is also immediate to verify that v(i) is independent of Ù; for all j < i, where w; 
denotes the weight-error vector that is now defined in terms of the time-variant weight 
vector w#, 


It also follows that v(i) is independent of the a priori estimation error ea (i), which in the 
nonstationary case is defined as 


AA 
eali) = ujw?- ujwi-i 


This definition is consistent with our earlier definition in the stationary case in Sec. 15.2, 
namely, e,(i) = u;w? — u;wi_1. In both cases, e; (i) is a measure of the error in esti- 
mating the uncorrupted part of d(i). However, observe that in the nonstationary case, we 
cannot write e;(7) = wu;tb; i since p; i = w , — wii. 

We summarize the above independence properties in the following statement. 


Lemma 20.1 (Useful properties) From the data model (20.16), it holds that 
v(i) is independent of each of the following: 


(w; forj «ij, {hj = wj—w; forj <i}, and ea(i) = uj(w? — wi-1) 


Alternative Expression for the EMSE 

Using model (20.16), and the result of Lemma 20.1, we can express the filter EMSE in an 
alternative form. Indeed, using e(i) = d(i) — u;w;.;, and part (a) of model (20.16), we 
have e(i) = v(i) + u;(w? — wi-1), i.e., 


e(i) = v(i) + eq(1) (20.19) 


Now since v(i) and e, (i) are independent, and v(i) has zero mean, it follows from (20.19) 
that 
Ele(i)? = Elv(i)|? + Elea(i)|? 


The first term on the right-hand side is c2 which, as explained above in (20.17), coincides 
with Jmin and, hence, 
E je(é)|? —Jmin =E lea (2)? 


Substituting this equality into definition (20.3) for the EMSE we arrive at the equivalent 


characterization 
EMSE = limE lea (i)? (20.20) 


$—OO 


In other words, the EMSE of an adaptive filter operating in a nonstationary environment 
can be computed by evaluating the steady-state variance of the estimation error e, (1). 


Degree of Nonstationarity 
A lower bound on the EMSE can be determined as follows. Using w? = w? , + q; we 
have 


eali) = wjiw?-ujiwi-i 
= Uui(wi,-tdqi)-uiwii 
u;(w;.,-— wi-i)- uidi 


so that 


Ele. (f? |? 


E |u;(w?_; — wi-1) + uiq; 


2 
Elu(w?,-wii) + E juiq;l? 


E |eigi |” 
= Tr(R,Q),  foralli 


IV 


where in the second equality we used the fact that q; is independent of u;(w? , — wi-1) 
and, hence, their cross-correlation is zero. In the last equality we used the fact that q; 
and u; are independent. We therefore find that the misadjustment of an adaptive filter in a 
nonstationary environment is lower bounded by 


M 2 T(R.Q)/c5 


The ratio on the right-hand side, involving {Ru, Q, c2), is equal to the square of what is 
called the degree of nonstationarity (DN) of the data, 


DN Ê VRQ) 


If the value of DN is larger than unity then this means that the statistical variations in the 
optimal weight vector w? are too fast for the filter to be able to track them (and the mis- 
adjustment will be large). On the other hand, if DN < 1, then the adaptive filter would 
generally be able to track the variations in the weight vector. In this chapter, we are inter- 
ested in evaluating the tracking performance of adaptive filters in this latter situation, i.e., 
when tracking is possible. 


Error Quantities 

For ease of reference, we collect in Table 20.1 the definitions of several of the error mea- 
sures introduced so far. Observe that, in contrast to the stationary case shown in Table 15.1, 
the definition of tb; is now with respect to w?, and the definitions of the a priori and a pos- 
teriori estimation errors (e, (i), e; (1)) are in terms of w?. Comparing with the definitions 
in the stationary case, observe that e,(i) and e,(i) in Table 20.1 are not expressed as 
uU, , and uj, respectively. While this form of expression is valid for ej (i), it is 
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not valid for e, (1) in the nonstationary case. The proper interpretation for (e, (i), ej (i), 
for both cases of stationary and nonstationary models, is to regard them as the errors in 
estimating the uncorrupted part of d(i) by using {u;wi_1, u,w;}. 


TABLE 20.1 Definitions of several estimation errors. 


Definition 


| 
" 
eto | wt | 


As was done earlier in Chapter 15, before moving on to derive expressions for the 
EMSE of several adaptive algorithms under model (20.16), we first establish a fundamental 
energy-conservation result that holds for general data {d(i), u;} (i.e., it does not require 
the assumptions (20.16)). The derivation that follows is distinct from the arguments we 
used earlier in Sec. 15.3 only in that it uses the new definition of w;, which is relative to 
the time-variant coefficient w?. Therefore, we shall be brief. 


Interpretation 


a priori output estimation error 


a posteriori output estimation error 


weight estimation error 


a priori estimation error 


a posteriori estimation error 


20.3 ENERGY CONSERVATION RELATION 


We again consider adaptive filters whose update equations are of the form 


wi -wi;i-pulgle(i)), w- = initial condition (20.21) 


where g[-| denotes the error function; several examples were listed in Table 15.2. 
The update recursion (20.21) can be written in terms of the weight-error vector t»; = 
w? — w,. Subtracting both sides of (20.21) from w? gives 


we — w; = (w? — wi-1) — pujgle(i)] (20.22) 


Multiplying both sides of this equation by u; from the left we find that the a priori and a 
posteriori errors (ej (i), e, (1)) are related via 


ep(i) = ea(i) — ulluil?gle(1)] (20.23) 


Equations (20.22) and (20.23) have the same form as equations (15.24) and (15.25) that 
we derived earlier in Sec. 15.3. Therefore, following the exact same arguments that we 
presented in that section, we arrive at the following extension of Thm. 15.1. 


Theorem 20.1 (Energy-conservation relation) For any adaptive filter of the 
form (20.21), and for any data (d(i), u;}, it holds that 


w? — wil? + g()les(2)? = w, — weil? + B@lep(d)|? (20.24) 


where e4(i) = u(w? — wi_1), ep(i) = ui(w? — wi), and ji(2) is defined as 


pli) Ê (uf)! = { Vnd? ifon #0 


-———— A: 


Comparing (20.24) with the energy-conservation relation (15.32) in the stationary case, 
we see that the only difference pertains to the interpretation of the terms w? — w; and 
w?—w;. that appear on both sides of (20.24). While the first difference can be recognized 
as tb, just like the term on the left-hand side of (15.32), the second difference is not 1; 
since, in the nonstationary case, 1; i is defined as 45; ; = w?_, — wj. (in terms of 
w?., and not 7). 


20.4 VARIANCE RELATION 


In order to explain the relevance of the energy relation (20.24) to the tracking analysis of 
adaptive filters, we refer to the data model (20.16) and, in particular, to condition (b). The 
condition states that w? varies according to the random-walk model 


we = wit qi (20.25) 


where q; is an i.i.d. sequence with covariance matrix Q and is independent of the initial 
conditions {w2 ,, w—1}, of {u;} for all j, and of (d(j)) for all j < i. This random- 
walk model, and the assumptions on q,, are the only conditions that we require from the 
data model (20.16) for the derivation in this section. [The conditions on v(i), such as its 
independence of all u; and q;, are not needed here.] 

Taking expectations of both sides of the energy-conservation relation (20.24) we get 


El&;|? + ER leali)? = Elw? — wi-il? + En()le(? (20.26) 


since db; = w? — w;. Moreover, the model (20.25) allows us to relate E ||w? — w;.1||? to 
E |dv;..; ||? as follows: 


Elw? wil? = Elw +4; — wil? 
= Elj@_i+4il’ 
= E(wi-1+4;)"(Wi-1+4@;) 
= Ele? + Ellas? + Ewa, + Egið: (20.27) 


We can be more explicit about the cross terms E; ,q; and E qf r;..,. Indeed, note that 


i-1 
4 o 
Wwiic-uwii;-woi-qwit J q; | — Wi-1 
Ln 
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so that 


* 
i-1 
Ew;;q; = E w^; - Ma; q; — Ewi 4i 
= 


= -—Ewij qi 


where we used the fact that q; is independent of all previous q; and of w2,. But since 
wi- is a function of the variables 


{ ui-1,..., uo, d(i — 1),...,d(0) } 
all of which are independent of q;, we conclude that 
Ei; ,q; —0 
Likewise, E q71p;., = 0. Substituting into (20.27) we find that 
Elliw? — will? = Elfe;-ill? + Tr(Q) (20.28) 
and (20.26) becomes 


Ell@i|? + Ep()lea(i)|? = E[ioi-il? + Eg(DlesG)? + TQ) | (20.29) 


Comparing with relation (15.36) in the stationary case, we see that the only difference is 
the appearance of the additional term Tr(Q). All other terms are identical! 


Steady-State Performance 
Now assume that an adaptive filter is operating in steady-state (cf. Def. 15.1), i.e., 


Eji;l? = Eli; il^, as i— oo (20.30) 
It then follows from (20.29) that 
Ea(i)le (P? = TQ) + En(;)les (D, asi oo 
Using (20.23), we can replace ej (i) in terms of e, (4) and get 
Ea()lesG)f* = Tr(Q) + Efi) |ea(é)  nluil?gle()]| , asi oo (20.31) 


This relation can be simplified by expanding both of its sides, and by following the same 
steps that we carried out for (15.38), thus leading to the following conclusion. 


——————À 


Theorem 20.2 (Variance relation) Consider any adaptive filter of the form 
(20.21), and assume filter operation in steady-state. Assume further that 


d(i) = uw? + v(i) 
where w? varies according to the random-walk model w? = w? , + q; and 
q; is a zero-mean i.i.d. sequence with covariance matrix Q. Moreover, q; is 
independent of {d(j), j < i} and of (u;, w?,) for all j. Then it holds that 
uE [uil gleh]? + u TQ) = 2Re(Eez(i)g[e(i)]), asi — oo (20.32) 


where e(i) = ea(i) + v(i). For real-valued data, the above relation becomes 


pE lluil?g?[e(i)) + KTQ) = 2Ees(i)gle(i), asi- oo (20.33) 


Alternatively, the variance relation (20.32) can be obtained by starting directly from the 
weight-error vector recursion (20.22) and equating the squared Euclidean norms of both 
sides, i.e., 


lw? — will? = (w? — wi~) — pujgl|? 
jew? — wiil? + 1? wil? igi? — 2uRelez(i)g) 


{I 


Taking expectations of both sides as i — oo and using (20.28) gives (20.32). Comparing 
expression (20.32) with the variance relation of Thm. 15.2 in the stationary case, we see 
that the only difference is the appearance of the additional term 4,7 ! Tr(Q) on the left-hand 
side of (20.32). All other terms are identical to those in (15.40). This observation shows 
that obtaining the EMSE of an adaptive filter in a nonstationary environment is a straight- 
forward extension of the calculations carried out in Chapter 15 for obtaining the EMSE of 
the filter in a stationary environment. 


Relevance to Tracking Analysis 

Relation (20.32) is an identity that involves e, (i) and, in principle, it could be used to eval- 
uate the EMSE of an adaptive filter in a nonstationary environment. We say "in principle" 
because, although the result (20.32) is exact, different choices for the error function gf] 
lead to different equations in E [ea (oo)|?, some of which are easier to solve than others. 
It is at this stage that simplifying assumptions become necessary. We shall illustrate this 
procedure for several adaptive filters in the sections below. As in Chapter 15, we shall 
employ the symbol ¢ to refer to the EMSE of an adaptive filter. 
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Ts reader will soon realize that the arguments that are used in the sequel in order to 
arrive at the nonstationary EMSE, starting from the variance relation (20.32), are almost 
identical to the arguments used before in Chapters 16-19 while studying the stationary 
EMSE of several adaptive filters. For this reason, the derivations here are brief. 


21.1 PERFORMANCE OF LMS 


We start with the simplest of algorithms, namely, LMS. Thus, assume that (d(i), w;} sat- 
isfy model (20.16) and consider the LMS recursion 


wi; = Wi-1 + pu;e(i) (21.1) 


for which 
gle(i)] = e(i) = e«(i) + v(i) (21.2) 


Relation (20.32) then becomes 
pE |u;l? leali) + v] + n7  Tr(Q) = 2Re(Eez(G)les( +00) ^ QL3) 
Except for the term 


u Tr(Q) 


on the left-hand side, we note that the identity (21.3) has the same form as the identity 
(16.3) that appeared in our study of the mean-square performance of LMS in Chapter 16. 
Therefore, performing the same expansions that we did in that section following (16.3), we 
can readily verify that (21.3) leads to 


CUMS = i[yE[wui?les()? + uo2Tr(R,) + n7! Tr(Q)], as i—oo| Q14) 


This expression extends the result (16.5) to the nonstationary case. In order to evaluate 
cLMS = Ele, (oc)|?, we again examine three cases (the similarities with the arguments in 
Chapter 16 are obvious). 


Small Step-Sizes 
Assume first that the step-size jz is such that, in steady-state, the term 
pE || us|? lea C? 
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can be neglected when compared to the term 
po; tr( Ru) + u TQ) 


This condition occurs for data with a sufficiently small degree of nonstationarity and for 


step-sizes ji in the vicinity of „bMS in (21.6). Then, expression (21.4) gives 


(HMS = eR) vu TQ) (21.5) 


This result highlights the effect of the step-size on the performance of LMS. The term 
uc2 Tr(R,) is the same one we encountered earlier in expression (16.6) while studying 
the EMSE of LMS in stationary environments. The additional term 1^! Tr(Q) reflects the 
effect of the nonstationarity in the weight vector w? on filter performance. Observe in 
particular that Tr(Q) appears multiplied by 1^! so that the larger the step-size the smaller 
the effect of the nonstationarity on the EMSE. This behavior is intuitive since a larger 
step-size (usually) signifies faster adaptation, in which case LMS will have a better chance 
at "learning" and at "following" the data statistics. A small step-size, on the other hand, 
leads to smaller EMSE under stationary conditions, but it may also lead to poor tracking 
performance. 

This discussion suggests that there exists a compromise choice for the step-size, which 
is obtained by minimizing (21.5) with respect to jz. Setting the derivative of (V5 equal to 
zero gives 


utNS = \/Tr(Q)/o2Tr(Ru) (21.6) 


Substituting the above optimal value for p into (21.5) we find that the resulting minimum 
EMSE is given by 


CLS = \/o2Tr(Ru) Tr(Q) (21.7) 


Separation Principle 
Rather than neglect the effect of the term pE ||u;||?]ea (i)|? in steady-state, we can call 
upon the separation assumption (16.7) that we introduced earlier in Sec. 16.3, namely, that 


At steady-state, ||u;||? is independent of ea(i) (21.8) 


Using this assumption we have 


E (IIwil? - lea (2)]?) (E llu:l?) - (Elea(i)/*) 


Tr(Ru)E |ea(i)|? 


ll 


so that substituting into (21.4) we obtain 


ims _ H8 Tr(Ru) + ui Tr) 
= IRA (21.9) 


Once more, this expression differs from expression (16.10) in the stationary case only by 
the additional term 4,7! Tr(Q) that appears in the numerator. Observe further that if p is 
such that 2 — uTr( Ru} z 2, then (21.9) reduces to (21.5). 
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Differentiating (21.9) with respect to u leads to the following expression for the optimal 
step-size (see Prob. IV.22): 


LMS _ TQ Te) (21.10) 


Hopt = o?Tr(Ra) 40$ 


Substituting into (21.9) we can find the minimum value for the EMSE. We forgo this cal- 
culation here. 


White Gaussian Input 


As discussed in Sec. 16.4, one particular case for which the term E |ju;||?lea(i)|? that 
appears in (21.4) can be evaluated in closed-form occurs when the regressor u; has a 
circular Gaussian distribution with a diagonal covariance matrix of the form 


R,=021, o2>0 (21.11) 


The diagonal structure of R, means that the entries of u; are uncorrelated among them- 
selves and that each has variance 02. In addition to (21.11), we also assume that (see the 
justification in Sec. 16.4): 


At steady state, w,_1 is independent of u; (21.12) 


Under conditions (21.11)-(21.12), we showed in Sec. 16.4 that 
E ||u;||7]ea(é)|? = (M + »)e;Eles CO? 


where y = 2 for real-valued data and y = 1 for complex-valued data — see (16.17) and 
(16.19). Substituting into (21.4) we obtain 


1 
OMS = 2 [u(M + 7)o20'MS  uMozo; + u^ T«(Q)] 
so that 


quus = uMo$oi + KTQ) (21.13) 
2 — u(M +)o2 l 


Here again, the only difference from (16.18) is the presence of the additional term u^! Tr(Q) 
in the numerator. Differentiating (21.13) with respect to u gives 


LMS = Tr(Q) E M +P]? — (M+R) (21.14) 


ot V Ma2o2 AM?c2 2Mo2 


We summarize the results for LMS in Lemma 21.1. 


Lemma 21.1 (Tracking EMSE of LMS) Consider the LMS algorithm (21.1) and 
assume {d(i), u;} satisfy the nonstationary model (20.16) with a sufficiently 
small degree of nonstationarity. Then its EMSE can be approximated by the 
following expressions: 


1. For smal! step-sizes, it holds that 


QNS = uc; Tr(Ru) + u7 TQ) 
2 
Tr(Q) ; 
LMS _. LMS _ 2 
Hopt c2Tr(R,) with min o2Tr( Ry) Tr(Q) 


2. Under the separation assumption (21.8), it holds that 


(MS = uo2 Tr(Ru) + u-  Tr(Q) 

E 2 — pTr(Ru) 
ws . Jj T(Q , (TQ) To) 
Pot = VR, dot 202 


3. If u; is Gaussian with R, = c?I, and under assumption (21.12), it holds 
that CM5 and jLV5 are given by (21.13) and (21.14), respectively, 
where y = 2 if the data are real-valued and y = 1 if the data is 


complex-valued with u; circular. Here M is the dimension of u;. 


In all cases, the misadjustment is obtained by dividing the EMSE by g2. Also, 
substituting the expressions for 1455; into the expressions for EMSE we find 
the corresponding optimal EMSE. 


Remark 21.1 (Auto-regressive model) The results derived so far assume that the weight vector 
w? varies according to the contrived model (20.10). However, we extend the results in the problems 
at the end of this part to a more realistic nonstationary model. For example, a specialization of the 
result of Prob. IV.33 (obtained by setting the variable in that problem to zero) shows the following. 
If w? varies according to (20.15) (or, equivalently, (20.14)), for some |o] < 1, then the EMSE of 
LMS is approximated by 
(S = uos tr(Ru) + 4718 
2 
where the scalar 8 is defined by 
2 x 
B = Ip’ {Tr [(1 - Re(a))I + (1-a*) Xa — u(1- o) R]Q) 


Xa = (I— Rua) [o*1- (I-4R,)] ^ 


Other approximations for ("5 also appear in the problem. It can be seen from the expression for 8 
that 6 — Tr(Q) as œ — 1, in which case we recover the first result stated in Lemma 21.1. 
o 
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21.2 PERFORMANCE OF NLMS 


We now examine the tracking performance of e-NLMS, 


Wi = Wi-1 + e+ lu? e(i) (21.15) 
for which (i (i) + v(i) 
vy — eli) e) + v(i 

gle(i)] = ete? eed (21.16) 


and the data (d(i), ui, v(i)} are assumed to satisfy model (20.16). The variance relation 
(20.32) in this case becomes (with 4 — oo): 


E [RSS + po Ti(Q) = 2Re { Eext) (2-2) (21.17) 


(e + ui 2? € + [nul 


Separation Principle 

Except for the term ^! Tr(Q), the above equality has the same form as the identity (17.3), 
which appeared in our study of the mean-square performance of e-NLMS in the stationary 
case in Sec. 17.1. Therefore, by performing the same expansions that we did in that section 
following (17.3), we can find that the above equality reduces to 


(2m. — Hau )E lea (i)? = 1020, + u*Tr(Q), i oo 


so that i Q 
e= Q40,-u dr 
C NLMS _ # 2 T (Q) (21.18) 
u u 
where ' \2 i 
A ui A 
9E T NBI a A u = E (——. 21.19 
j (ater) i (ci) Tope 


When the regularization parameter e is small, as is usually the case, the values of 7,, and 
04, coincide, i.e., a, = Nu = E (1/l|u;||?), and we get 


-1 
k aed (when e is small) (21.20) 


1 
ih lao? + ECP) 


An alternative expression for the EMSE of e—NLMS can be obtained by using the as- 
sumption e zz 0 in order to initially simplify (21.17) into 


E (ir) + PE (Gap) + HO = ae ETF) 


Then we appeal to the steady-state approximation (17.8), instead of (21.8), to find that 


Ce-NLMS m Tr(Ru) uote (r) + WT) (when e is small) | (21.21) 


2-p 


In both cases (21.20)-(21.21), approximate expressions for the optimal step-size can be 
obtained by differentiating (*- NM5 with respect to u — see Prob. IV.23. 


Lemma 21.2 (Tracking EMSE of e-NLMS) Consider the e-NLMS recursion (21.15) 
and assume (d(i), u;} satisfy the nonstationary model (20.16) with a suffi- 
ciently small degree of nonstationarity. Then, under the separation assump- 
tion (21.8) and for small e, its EMSE can be approximated by 


| i KTQ) | 


€—NLMS 
s E Tus?) 


2-u 


or, under the steady-state approximation (17.8), 


ce-NLMS I [uote (r) eam) 


Optimal choices for the step-size parameter, and the resulting minimum 
EMSE for these approximations, are given in Prob. IV.23. The misadjust- 
ment is obtained by dividing the EMSE by c2. 


21.3 PERFORMANCE OF SIGN-ERROR LMS 


We now examine the tracking performance of the sign-error LMS algorithm, 
wi = wi-i + pu;csgn|e(i)] (21.22) 
for which 
gle(i)] = csgn[e()] = csgn[es (i) + v(i)] (21.23) 


and the data (d(i), u;, v(i)) are assumed to satisfy model (20.16). As in Chapter 18, we 
need to distinguish between the case of real-valued data and complex-valued data. 


Real-Valued Data 
Assume first that all data are real-valued. In this case, 


esgn|z] = signi] 


so that 
g(x) =1 
almost everywhere on the real line, and the variance relation (20.32) becomes 
uTr(Ru) + u TQ) = 2bEes(i)sign|es(i) + v(i), i— oo (21.24) 
Again, except for the term ~!Tr(Q), this identity has the same form as the identity (18.3) 
that appeared in our study of the mean-square performance of sign-error LMS in Chapter 18 


for stationary environments. Therefore, performing the same expansions that followed 
(18.3) and assuming that (cf. (18.4)): i 


The estimation error e(i) and the noise v(i) are jointly Gaussian (21.25) 
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we can invoke Price’s theorem (18.6) to find that 


mane ne) ons eel 
m J/Ee2(i) +02 


asi — oo (21.26) 


Solving for E e2(oo), we get 


çsisn—error LMS — 3 (e + Ve? +402) (21.27) 
where 
os V3 [uTr(Ru) + n1 Tr(Q)] (21.28) 


It can be seen from (21.27) that the EMSE of sign-error LMS increases with the value of a 
(since the derivative of the EMSE with respect to a is positive for positive o). This shows 
that the EMSE can be minimized by selecting a step-size that minimizes a. Differentiating 
the expression for a with respect to yz and setting the derivative equal to zero we get 


pope NS = VTrIQ)/TIS) 


The resulting minimum EMSE is 


sign—error LMS _ T 802 
Cait E q (Ru) TQ) ( Tajlc T(R) THQ) 


Complex-Valued Data 
When the data are complex-valued, the sign function is defined by 


esgn{z] & sign(z,) + jsign(zi) 
where x = x, + jz. It follows now that 
lez)? = 2 
almost everywhere in the complex plane, so that the variance relation (20.32) becomes 
uTr(Ru) + u Tr(Q) = 2Re(Ee%(i)esgn[ea(i) + v(i)]), i— oo (21.29) 


Similarly to the real case, and as was done in Sec. 18.2, we can now invoke Price’s theorem 
for complex data (from Prob. IV.11) and rewrite (21.29) as follows: 


3 VBE lea(i)/ 


SET URS iQ) e POETA 


which has a form similar to (21.26). This equality can now be solved for E |e, (00)|?. 


Lemma 21.3 (Tracking EMSE of sign-error LMS) Consider the sign-error LMS 
recursion (21.22) and assume the data (d(i), u;} satisfy model (20.16) with 
a sufficiently small degree of nonstationarity. Assume also that (v(i),u;) 
are Gaussian and the step-size is small. Then its EMSE as a function of the 
step-size can be approximated by 


ign~error LMS _ X 
(sen error E 5 (a + a? +402) 


where 


€ "Es [wyTe( Ra.) + 7! T(Q)] 


with y = 1 for real-valued data and y = 2 for complex-valued data. The 
optimal step-size, and the resulting minimum EMSE, are given by 


sign—error LMS _ Tr(Q) 


opt yTr( Ry) 


sign—error LMS _ F 802 
asi = q (Ru) TQ) (: +4flt+ TRATO) ) 


The misadjustment is obtained by dividing the EMSE by o2. 


21.4 PERFORMANCE OF RLS 


We now examine the tracking performance of the RLS algorithm, 


-1 
À Pj ;ujuPia 


Pj = M!|Pia- 
i 1+ Antu, Piu 


(21.30) 


I 


Wi Wwi-1ı + Pi v; [d(i) = UiWi-1], i20 (21.31) 


with initial condition P.., = e^ !Land 0 « A < 1. The scalar € is a small positive number 
and À is usually close to one. As explained in Chapter 19, we are using a boldface letter 
for Pj to indicate that P; is a random variable due to its dependence on the (u;). Also, 
recall from (19.3) that 


aera 
Py! = MH + 2, X fuju; (21.32) 
j= 


which shows that P; > 0 for all finite i. 

In a manner similar to Chapter 19, the energy-conservation approach of Sec. 20.3 can 
be extended in a straightforward manner to treat this case. Indeed, it is immediate to verify 
that the result of Thm. 20.1 becomes 


lw? — will» + aG)les(G) = lw? — wiil- +A lepli) (21.33) 
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where €,(i) = u;(w? — wi-1), ep(i) = ui(w? — wi), and (i) is defined as 


ays 1 ui 2 . df " 

pi) 2 (uie)! = { Viele, cro (21.34) 
Comparing (21.33) with (20.24), we see that the main distinction is the appearance of the 
weighting matrices {P7 x^ Pi): the former is used as a weighting factor for the weight- 
error vectors while the latter is used as a weighting factor for the regressors. Recall again 
that the notation ||z||2, stands for |||, = z"Wz, for a column vector x. The same 
argument that led to (20.29) will then show that (21.33) leads under expectation to 


(21.35) 
We are interested in evaluating the tracking performance of RLS in steady-state. For 
this purpose, we shall employ the steady-state condition (cf. Def. 15.1), 


Ed] = Ed; d],-— C,  asi—oo (21.36) 


in order to simplify the variance relation (21.33) into a form similar to that in Thm. 20.2. 
However, the presence of the matrices {P;, P; !) in (21.35) makes this step challenging; 
this is because (P;, P; !) are dependent not only on u; but also on all prior regression 
vectors, {u;,j < i}. For this reason, as was done in Chapter 19 and in order to make the 
performance analysis of RLS more tractable, whenever necessary, we shall approximate 
and replace the random variables {P7 1, P) in steady-state by (cf. (19.11)-(19.12)): 


(21.37) 


and 
EP; & [EP;'] = (0-XJR; = P, a ioc (21.38) 


These approximations, along with the steady-state condition (21.36), allow us to write 
(cf. (19.13)): 


Elm = Elh- (21.39) 


so that (21.35) reduces to 


EA(i)lea(i)|? = Ellgillp-1 + EA@)lep(i)|?, asi oo (21.40) 


However, from (19.5) we know how e,(#) is related to e; (i). Substituting into (21.40) we 
get 


Eà(i)les(i)|? = Ellailo-: + Eğ(i) |ea(i) - llus, el, asi— oo (21.41) 
which upon expansion and simplification reduces to 


(21.42) 


This variance relation is the extension of (20.32) to the RLS case. Now since the data 
{d(i), u;} satisfy model (20.16), we also have that e(i) = ea(i) + v(i) and (21.42) be- 
comes 


OLE will, + Ellul, leali)? + Ellailip1 = 2Elea(#)|? = 2083, i— oo 


(21.43) 
To proceed we resort to the separation condition (16.7), namely, we assume that 


At steady-state, luill, is independent of ej (i) (21.44) 


This condition allows us to separate the expectation E || u;||}, es (i)? into the product of 
two expectations as E (||w;||b, - Jes (7) 2) = (E || will, ) - (E]ea G) 2). If we now replace 
Pj by its mean value, we obtain 


Ellul, ~ Eluwll = T(R,P) = (1-3)M (21.45) 
Moreover, since q; is independent of all regressors, it is also independent of P; 1 so that 
Eligilip-1 = Elelb-:  T(QP^) = Tr(QRu)/(1— X) 
Substituting into (21.43) we find 


o;0—-XM + qi TQR.) 


(RS = v ij-ü-XM (21.46) 


Lemma 21.4 (Tracking EMSE of RLS) Consider the RLS recursion (21.31)- 
(21.30) and assume (d(i), ui) satisfy the nonstationary model (20.16) with 
a sufficiently small degree of nonstationarity. Introduce the approximations 
(21.38) and (21.45). Then, under the separation condition (21.44), the track- 
ing EMSE of RLS can be approximated by 


RIS _ o;0-A)M + gy (QR) 
=. Teta 


The misadjustment is obtained by dividing the EMSE by c2. Assuming (1— A) 
is small so that 2 — (1 — A)M 2, the optimal choice of A that results in 
minimal EMSE and the minimum EMSE are given by 


AM -a-LQ PE) am GRE = on VITOR.) 


21.5 COMPARISON OF TRACKING PERFORMANCE 


In order to get an appreciation for the tracking behavior of adaptive filters, we compare 
in this section the performance of three such filters, namely, LMS, LMF, and LMMN. We 
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focus on the case of small step-sizes, since the expressions for the EMSE are simpler in 
this case. We also assume real-valued data. 

First recall from the statement of Lemma 21.1, and from the results of Prob. IV.25 that, 
for small step-sizes, the minimum achievable EMSE of the aforementioned three algo- 
rithms are given by 


cLMS = c2Tr(R,)Tr(Q ] GEME RM E8Tr( Ru) Tr(Q) LMMN = aT FT) 


min min 302 , min 
where 
a = 60-2606.) 69, b-6--36c2 
6 = 1-6 G-El|v(i)|*, €= E|lv(i)]e 


In this section, we use the ratio of the minimum achievable EMSE for each of LMF and 
LMMN to that of LMS as a performance measure. For LMF, this ratio is equal to 


LMS 3 
min .. 30; 


umo VEO 


and is seen to depend only on the statistical properties of the noise sequence v(i). For 
LMMN, the same ratio is given by 


(21.47) 


LMS b 
Sain = P (21.48) 
min 


which is also dependent on the statistical properties of the noise, as well as on the mixing 
parameter ó. We specialize these results for some noise distributions. 


Gaussian noise. Assume that v(i) is Gaussian. Then £4 = 3o? and £9 = 150°, so that 
(21.47) becomes 


LMS 3 
mmn = 4/> = —lldB 
d 5 


where for a value x, we are using 10logio(x) as its dB equivalent. This result indicates 
that the minimum achievable EMSE of LMS is less than that of LMF by approximately 
1.1 dB, for any value of the noise variance c2. 


For the case of LMMN, expression (21.48) yields 


b 
min = (21.49) 


QN — V/ó2a2 + 60004 + 155208 


and Fig. 21.1 shows a plot of this ratio versus 6 for various values of c2. The figure 
shows that the ratio is always less than unity for all values of ô and c2, which reflects 
the superiority of LMS over LMMN for tracking nonstationary systems in Gaussian noise 
environments. 


Uniform noise. Assume now that the noise v(i) is uniformly distributed over an interval 
[-A, A]. Then it is easy to verify that o2 = A?/3, €4 = A*/5, and ¿£ê = A®/7, so that 


from (21.47) we get 
Gentile 7 
min PES ~ 
EU = "H z 3.7 dB 


min 


Tracking performance for different choices ofS and Gaussian noise 291 
a 


SECTION 21.5 
COMPARISON 
OF TRACKING 

PERFORMANCE 


"min 


tins | CMMN 
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Parameter à 


FIGURE 21.1 Comparison of the tracking performance of LMS, LMF, and LMMN for Gaussian 
noise. The figure plots the ratio (21.49) as a function of 6 and c ` 


This result indicates that the minimum achievable EMSE of LMF is now less than that 
of LMS by approximately 3.7 dB. Figure 21.2 shows a plot of the ratio of the minimum 
achievable EMSEs for LMS and LMMN versus 6 for various values of o2. The figure 
shows that this ratio is now always larger than unity for all values of 6 and c2. We can 
also see that the choice 6 = 0 (which corresponds to LMF) results in the best tracking 
performance. We therefore find that for uniform noise, the LMF algorithm is superior to 
both LMS and LMMN. 


Tracking performance for different choices of& and uniform noise 


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Parameter ô 


FIGURE 21.2 Comparison of the tracking performance of LMS, LMF, and LMMN for uniform 
noise. The figure plots the ratio (21.48) as a function of ó and c2. 
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Mixed Gaussian and uniform noise. We now consider the case where the noise v(t) 
is a mixture of Gaussian and uniform distributions (e.g., a mixture of Gaussian system 
noise and uniformly distributed roundoff errors). In the simulations we generate v(i) as the 
sum of two Gaussian and uniform random variables with variance ratio 1 to 3. Figure 21.3 
shows the ratio of the minimum achievable EMSE of the LMS and LMMN algorithms 
versus ô for different values of the system noise variance o2. We see that in this case, the 
LMMN algorithm has the best tracking performance. 


Tracking performance for different choices ofô and mixed Gaussian/uniform noise 


112 
14] 2 


1.08 


LMS | -JLMMN 
min 


1.06 


[5 


1.04 


1.02r- 


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Parameter ô 


FIGURE 21.3 Comparison of the tracking performance of LMS, LMF, and LMMN for a mixed 
Gaussian/uniform noise distribution. The figure plots the ratio (21.48) as a function of 6 and c2. 


21.6 COMPARING RLS AND LMS 


Although the convergence performance of RLS is significantly superior to that of LMS, it 
does not necessarily follow that the tracking performance of RLS is similarly superior to 
that of LMS. Actually, there are situations where one algorithm supersedes the other and 
vice-versa, so much so that a general statement about how their tracking behaviors relate 
to each other is difficult to make. 

To illustrate this fact, recall from Lemma 21.1 that the excess mean-square error of LMS 
in nonstationary environments, and for sufficiently small step-sizes, can be approximated 
by 

u _ BUR C) 


with the corresponding optimal choice for the step-size and minimum achievable EMSE 
given by 


Hope = VTH(Q)/o2Tr(Ru), mn = VoeTt(Ru)Tr(Q) 


Likewise, from Lemma 21.4 we have that the excess mean-square error of RLS in nonsta- 
tionary environments, and for forgetting factors that are sufficiently close to one, can be 


approximated by 


RLS - cz(1 = AM ED 1 Tr(QRy) 


¢ 2 21-X 


with the corresponding optimal choice for the forgetting factor and minimum achievable 
EMSE given by 


Ae = 1- VTI(QR.)/o?M, ma = Vo2MTr(QR,) 


It follows that 


min _ | MTr(QRu) 


CMS Tr(Ru)Tr(Q) 


min 


from which is it seen that the performance of RLS and LMS will be similar whenever Ry 
or Q is a multiple of the identity matrix. For other choices of {Q, Ru}, one algorithm 
may perform better than the other. Three examples for different choices of Q are listed in 
Table 21.1: (i) Q = Bal i.e., the covariance matrix of the random perturbation vector q; 
is a multiple of the identity; (ii) Q is a multiple of R, and (iii) Q is a multiple of RZ 1 
It is seen from the results in the table that the performance of LMS is similar to that of 
RLS in case (i), while LMS is superior in case (ii), and RLS is superior in case (iii). The 
conclusions in cases (ii) and (iii) follow from the fact that for any M x M positive-definite 
matrix Ru, it always holds that (see Prob. IV.28): 


[Tr(R,)]? < MTr(R2) and M?<Tr(R,)Tr(R7!) | (for any R, > 0) 


Of course, these results on the tracking performance of RLS and LMS assume filter opera- 
tion in environments with a small degree of nonstationarity. 


TABLE 21.1 Minimum achievable excess-mean-square error of LMS and RLS (i.e., CLMS and 


CRES) for three choices of the nonstationary covariance matrix Q in comparison to the regressor 
covariance matrix Ru. In the table, a? is some constant value. 


VMTr(RE) .. 


Tr(Ru) 


LMS is 
superior 


similar 
performance 


21.7 PERFORMANCE OF OTHER FILTERS 


The same line of reasoning used in the previous sections is also used in the problems, and 
in Part V (Transient Performance), to study the tracking performance of other adaptive 
filters. For ease of reference, we list here the locations in the text that study these filters. 
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1. e-NLMS with power normalization. The algorithm is described by the recur- 


sions: 


p(i) 8p(i-1) + (1-8)w()?, p(-1-20 (21.50) 


A * afi 
A. acl i- — áÁ—u 21.51 
Wi wit ea (21.51) 


where £ is a positive number in the interval 0 < 8 < 1. The tracking performance 
of the algorithm is studied in Prob. IV.24 for Gaussian regressors with shift structure 
by extending the arguments of Sec. 21.2. 


. LMF and LMMN. The LMMN algorithm is described by (cf. Alg. 12.4): 


wi = wi_1 + puje(i)ó-(1-5)e()?], o«óx1 (21.52) 


which employs a linear combination of the e(i) and e(i)|e(i)|?. The least-mean 
fourth (LMF) algorithm corresponds to the special case ô = 0, ie, wi = wii 
uu; e(i)|e(i)|?. The tracking performance of these algorithms is studied in Prob. IV.25. 


. c€- APA. The c—APA algorithm is described by the recursions (cf. Sec. 13.1): 


ui d(i) 
[2 d(i—1 
U; = Sale. E ( l ) (21.53) 
Ui—i+K d(i —1-- K) 
€i = d; — Uiwi-i (21.54) 
wi = Wwia-AaUI(d-UI).e, i20 (21.55) 


where e is a small positive number and K < M. The energy conservation approach 
of Sec. 20.3 is extended to this case and the tracking performance of the algorithm 
is studied in Prob. IV.26. 


. Leaky-LMS. The leaky-LMS filter is described by the recursion (see Alg. 12.2): 


wi = (1 — ua)wi-i- pui[d(i) -uiwi-i, i20 


which looks similar to the LMS recursion (21.1) except for the presence of the factor 
a, The tracking performance of this algorithm is studied in Prob. V.34 for Gaussian 
regressors. 


. CMA. The CMA2-2 algorithm is described by the recursion 


wi = wi-1 + putz(i)y-|z(G)?] z()-wwii i20 


where y is a positive scalar (see Alg. 12.6). The tracking performance of CMA2-2 
is studied in Probs IV.34 and IV.35 for both real and complex-valued data. Prob- 
lem IV.36 extends the results to CMA1-2, which is described by the recursion (see 
Alg. 12.5): 


„| 20 : ; . 
Wi = Wi-1 + pu; bat — x) , Z(i-ujwii i20 


KOJ 


21.8 PERFORMANCE TABLE FOR SMALL STEP-SIZES 


We find it useful to list in Table 21.2 the tracking EMSE of several adaptive filters under 
the assumption of sufficiently small step-sizes. The expressions in the table are obtained as 
approximations from the results derived in the chapter and they serve as convenient refer- 
ences for comparing algorithms. The table also lists points of reference for each algorithm. 


TABLE 21.2 Approximate expressions for the excess mean-square performance of adaptive filters 
in nonstationary environments for small step-sizes. 


Algorithm EMSE Reference 


2 —1 
LMS EZT Ru) + É—T«(Q) Lemma 21.1 


1 


Hoy H 

e-NLMS 2 UE ud 4 2— ree) HQ) Lemma 21.2 

u(1 + 8)Moi + wo *you(1 — 8)Tr(Q) 
-NLMS Sa led NM BTA SR RR STE Problem IV.24 

i 2y(1 - 8) - uM + 8) 

with power 

normalization 

sign-error LMS = (a+ Va? + 403) Lemma 21.3 

a= /5 [iy Tr(R«) + a7! T«(Q)] 
LMF ues Tr(Ru) + E TQ) Problem IV.25 
202 202 

LMMN Bar (R.) 44 THQ) Problem IV.25 

a Se ee ee 
uos p` 
leaky-LMS i3 Tr[R2 (Ru + o1)7!] + —— Tr(QR,(R, + o1)^!] | Problem V.34 
c-APA HO TR, )E + LO T(R.)T(Q) | Problem 1V.26 
| 2-p 2-u 
c2(1— M + 4zLT(QR, 
RLS M ax ORs) Lemma 21.4 


2—(1—A)M 


2|912 _ 4 6 -1 
CMA2-2 HE (y"|8\" — 2l + |s|") Tr(Ru) + i^ TrOQ) Problem IV.35 


2E (2]s]? — 7) 


-1 
CMAI-2 5 (7? + Els? — 27E |s|) Tr(Ru) + E THQ) Problem IV.36 
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Summary and Notes 


The chapters in this part describe a procedure for evaluating the mean-square and track- 
ing performance of adaptive filters by relying on energy conservation arguments. Some of 
the key results in the chapters are reproduced here for ease of reference. 


SUMMARY OF MAIN RESULTS 


Stationary Environments 
1. Consider adaptive filters that are described by stochastic difference equations of the form 


wi = Wi-1 + u u; gle(2)], w 1 = initial condition 


where g[-] is some function of e(i) = d(i) — u;iwi-1. The mean-square error (MSE) of any 
such filter is defined as the limiting value of E je(z)|?, 


MSE = lim Ele(i)|? 


2. When the data {d(i), u;) satisfy the stationary model (15.16), then the MSE is also given by 
MSE = c2 + lim Elea(i)|? 


where e; (i) = uitb;-i and Ù; = w° — w;. The limiting variance of the a priori error eq (i) 
is called the excess mean-square error (EMSE) of the filter, 


EMSE = lim Ele,(z)|? 


3. For any such adaptive filter, and for any data (d(i), ui), the following energy conservation 
relation holds for all 2: 


lal? + ailea? = liii? + aileri)? 


where ep (i) = wu; tw; and A(i) is defined as in (15.31). 
4. The energy relation is useful in several respects. For the purposes of steady-state analysis, we 
noted that by taking expectations of both of its sides, and by using the fact that in steady-state 
E ||, ||? = E ||tb; 1 ||, we arrived at the variance relation: 
HE |u: |?igle(i)]|? = 2Re(Ees(i)gle(i)]), asi— oc 
For real-valued data, this relation becomes 


pE ||uill’9°[e(@)] = 2E ea(é)gle(i)], asi oo 
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sign-error LMS, and RLS. For RLS, we showed how to incorporate weighting into the energy Part IV 
conservation relation. In the problems at the end of this part, and also in Part V (Transient DE 
Performance), we evaluate the EMSE of other adaptive filters, e.g., LMF, LMMN, e—APA, 

leaky-LMS, sign-regressor LMS, constrained LMS, CMA2-2, and CMA1-2. 


Nonstationary Environments 


1. When the data (d(i), ui} satisfy the nonstationary model (20.16), then the MSE is still given 


by 
MSE = c? + lim Elea(i)|? 
1—0o0 


where now ea(i) = ui(w? — wi-i). The limiting variance of the a priori error e, (i) is 
called the excess mean-square error (EMSE) of the filter, 


EMSE = lim Ele,(i)]? 


2. For any such adaptive filter, and for any data (d(?), ui), the following energy conservation 
relation holds for all 7: 


lw? — wl? + alea)? = fw? — will? + A@Mlep(@)/? 


where ep(i) = ubi, iv; = w? — wi, and i(i) is defined as in Thm. 20.1. 
p 


3. By taking expectations of both sides of the energy conservation relation, and by using the fact 
that in steady-state E ||i;||? = E || ib; 11/2, we arrived at the following variance relation for 
nonstationary environments: 


HE uil? lgleG))? + uw Tr(Q) = 2Re(Eez()gle(i)]), asi oo 


For real-valued data, this relation becomes 
pE ||uil^g^le(i)] + n^ Tr(Q) = 2Ees(i)gle(i), asi— oo 


4. The variance relation was used to derive the EMSE of several adaptive filters in nonstationary 
environments: LMS, e—NLMS, sign-error LMS, and RLS. For RLS, we showed how to incor- 
porate weighting into the energy conservation relation. In the problems at the end of this part, 
and also in Part V (Transient Performance), we evaluate the EMSE of other adaptive filters, 
e.g., LMF, LMMN, e—APA, leaky-LMS, CMA2-2, and CMA1-2. 


5. In Probs. IV.29-IV.33 we use the variance relation to study the tracking performance of adap- 
tive filters under a more general nonstationary model than (20.16). Specifically, we assume 
that the data (d(i), ui, w$} are such that 

dli) = mwd vw), w?-w^-6; 0:=a6i-1+4;, Oxla|«1 

That is, we assume ww? undergoes random variations around its mean, w^, through a first- 

order auto-regressive model with a pole at a. In addition, we incorporate a parameter Q to 

model possible frequency offsets between transmitters and receivers. 


Energy Conservation 
Some of the features of the energy conservation approach used in this part are the following: 


a) It permits the evaluation of steady-state and tracking results without requiring a preliminary 
transient analysis. In other words, it does not obtain steady-state and tracking results as the 
limiting case of a transient analysis. In this way, steady-state results are not restricted by the 
same assumptions that are usually required for a successful transient analysis (see Part V). 
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b) It does not require knowledge of the weight error covariance matrix since it eliminates the 
effect of the terms E ||; ||? and E ||&; 1 |. 


c) Since it is easier to manipulate variables algebraically than under expectations, the energy 
conservation approach facilitates the steady-state analysis by eliminating unnecessary cross- 
terms. 


d) The energy conservation relation admits several interpretations: geometric, system-theoretic, 
and it relates to Snell's law for light propagation — see App. 15.A. 


Variance Relation 

The variance relation can be solved for different adaptive algorithms, which correspond to different 
choices of the function g[-], in order to evaluate the corresponding EMSE, i.e., E |ea(oo)|*. In the 
process of this computation, we usually need to rely on some assumptions in order to simplify the 
evaluation of some expectations, e.g., 


a) A small step-size assumption. 
b) A separation principle: at steady-state, {j2; ||? is independent of e, (i). 
c) A Gaussian assumption, namely, assuming the regressors u; are Gaussian. 


d) The independence assumptions, as described in Sec. 16.4. We did not rely on these assump- 
tions but used instead the separation principle of part (b) above. 


BIBLIOGRAPHIC NOTES 
Stationary Environments 


Energy-conservation. The energy-conservation relation (15.32) was originally derived by Sayed 
and Rupp (1995) and subsequently used by the authors in a series of works, including Rupp and 
Sayed (1996ab, 1997,1998,2000) and Sayed and Rupp (1996,1997,1998), in their studies on the ro- 
bustness and small gain analysis of adaptive filters. These robustness results will be discussed later 
in Chapter 44. In this part, we focused on showing how the energy-conservation relation can be 
used to study the steady-state mean-square performance of adaptive filters in a unified manner. The 
presentation in Chapters 15-19 follows mainly Yousef and Sayed (1999a,2000a,2001), where the 
variance relation (15.40) was derived from the energy-conservation relation (15.32), as well as Mai 
and Sayed (2000) and Shin and Sayed (2004). The last two references study constant-modulus and 
affine projection algorithms. 


Performance results. The steady-state performance of LMS and its variants has been studied 
extensively in the literature, using a variety of approaches: 
1. Expression (16.6) is the same result obtained by Jones, Cavin, and Reed (1982) for the per- 
formance of LMS for small step-sizes. 


2. Expression (16.10) is the same result obtained by Gardner (1984) for LMS. 


3. Expression (17.9) is the same result obtained by Slock (1993) for the performance of NLMS 
assuming a particular model for the regression data (as explained later in Prob. V.20). 


4. Expression (18.8) for the performance of sign-error LMS is the same result obtained by Math- 
ews and Cho (1987) by using, in addition to the Gaussian assumption (18.4), the independence 
assumptions (i)-(vi) of Sec. 16.4. The derivation in Chapter 18 did not use the independence 
assumptions. 


5. The performance of LMMN for small step-sizes from Table 19.1 is the same result obtained 
by Tanrikulu and Chambers (1996) using averaging analysis (averaging methods are discussed 
in App. 24.A). The specialization to LMF, by setting 6 = 0, is the same result obtained by 
Walach and Widrow (1984) by using the independence assumptions. 


EMSE of adaptive filters. In Chapters 16-19 we derived the EMSE for several particular choices 
of g[-] in (15.23). In Prob. 6.7 of Sayed (2003), and following the work of Al-Naffouri and Sayed 
(2001b,2003b), we derive a general expression for the EMSE of adaptive filters corresponding to 
choices of g[-|that are solely functions of e(-). We do so by appealing to a Gaussian assumption on 
the distribution of e; (i), and by using this assumption to rewrite the variance relation in an alternative 
form. More specifically, it is shown in the aforementioned problem that if we introduce the functions: 


Re(E ez (i)g) 

4 2 4 a 

hy = E|gle(i)]| and he = Elea(a)/? 

Then (Au, hc} are solely functions of E |e. (i)? and, moreover, the variance relation can be rewrit- 
ten as 


m hu (6) 
= -T(R 
gis) he(¢) 
in terms of the EMSE, C. Thus, given g[:], we can evaluate the functions hy and ha under the 
assumed conditions, and then proceed to find a fixed point of the above nonlinear equation in ¢. 
Problems 6.8—6.15 of Sayed (2003) illustrate this method of computation for several adaptive filters, 
and show that its results are consistent with the results obtained in this part. 


Price’s theorem. This theorem, which is due to Price (1958), plays a useful role in parts of our 
analysis, especially when the error signal is assumed to be Gaussian and/or the filter is assumed to 
be of sufficient length. The theorem can be found in Price (1958) and also in Papoulis (1991). It 
is further studied in Probs. IV.10-IV.12. In the last problem, a special case known as Bussgang’s 
theorem is derived (Bussgang (1952)). In Prob. V.26, the extension of Price’s theorem to complex 
data is considered (cf. McGee (1969) and van den Bos (1996)). 


Relating NLMS and LMS. In App. 17.A we relate e—-NLMS to LMS via the change of variables 
(17.10). This transformation allows us to derive results for e—NLMS from those for LMS. Although 
this change of variables has been used before in the literature, e.g., by Widrow and Lehr (1990) 
and An, Brown and Harris (1997), these earlier analyses have some limitations: (1) They consider 
the case e = 0. (2) The conditions for stability are based on results valid for LMS with Gaussian 
regressors, while it is clear from (17.10) that the transformed regressor ù; cannot be Gaussian (since 
it is bounded), and (3) no attention has been given to the fact that à; and (7) are still orthogonal 
random variables (in which case, it can be verified that NLMS still computes unbiased estimates). 
These issues are resolved in App. 17.A. 


Convex combination of adaptive filters. Combinations of adaptive filters provide one useful 
way to improve adaptive filter performance, whereby the outputs of several filters are mixed together 
to get an overall output of improved quality (see, e.g., Anderson (1985), Niedzwiecki (1992), and 
Singer and Feder (1999)). Clearly, the issue of how to optimally combine the component filters is 
a challenging task. In the work by Arenas-Garcia, Figueiras-Vidal, and Sayed (2006), the mean- 
square performance of a particular convex combination of two transversal filters is studied by using 
the energy conservation arguments of this part. Performance expressions are derived that indicate 
that the method is universal with respect to the component filters, i.e., in steady-state, it performs 
at least as well as the best component filter. The analysis also suggests combination structures with 
improved tracking performance; see Prob. IV.4 for a special case. 


Adaptive networks. Studies on distributed adaptive processing where filters interact with each 
other over both time and space in the context of adaptive networks appear in Lopes and Sayed 
(2006,2007a,b, 2008), Sayed and Lopes (2007), and Cattivelli, Lopes, and Sayed (2008). These 
references employ the same energy conservation arguments of this part to analyze the effect of tem- 
poral and spatial interaction on distributed adaptive filters; see also Prob. V.13. 


Colored noise and nonlinear effects. Recall that LMS is derived as a stochastic-gradient ap- 
proximation for solving the normal equations (8.4), which characterize the solution to the linear 
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least-mean-squares estimation problem (8.1). In Chapter 16, we studied how well LMS is able to 
approximate the optimal solution of the normal equations by relying on the stationary data model 
(15.16). A special feature of this model is that the noise sequence (v(i)), which corresponds to 
the optimal residual that results from estimating d(i) from w:, is assumed to be a white noise se- 
quence. Moreover, v(i) is also assumed to be independent of the regression data w; for all ¿, 7. In the 
work by Reuter and Zeidler (1999), the authors consider a special example in the context of channel 
equalization whereby the noise sequence (v(i)) is colored. Specifically, v(2) consists of a narrow- 
band signal embedded in white noise. In addition, the noise and regression sequences {v(7), ui} are 
highly correlated. Clearly, in this scenario, the analysis that we have given for the MSE performance 
of LMS in the text should be adjusted accordingly. For instance, in Reuter and Zeidler (1999) it was 
shown that, under such conditions on {v (i), u; }, the mean-square performance of LMS can even be 
superior to that of the normal-equations solution. One justification for this observation is that LMS 
processes the data in a nonlinear fashion. In other words, LMS is in effect a nonlinear filter and this 
property can lead to improved performance in situations with highly correlated data and noise. 


Ergodic approximations for RLS. The approximation (19.12) used for RLS is common in the lit- 
erature — see, for instance, Eleftheriou and Falconer (1986), Haykin (2000, p. 648), and Manolakis, 
Ingle, and Kogon (2000, p. 557). 


Constant-modulus algorithms. The steady-state performance of constant-modulus algorithms 
is studied in Probs. IV.16-IV.18 using the same energy-conservation arguments that we employed in 
the body of the chapter. The derivation in these problems is based on the work by Mai and Sayed 
(2000). Some of the earlier performance results for constant-modulus schemes appear in the works 
by Chan and Shynk (1990), Bershad and Roy (1990), Shynk et al. (1991), Li and Ding (1996), Zeng 
and Tong (1997), and Fijalkow, Manlove, and Johnson (1998). The article by Johnson et al. (1998) 
provides a comprehensive list of additional references on different aspects of CM algorithms. The 
work by Shynk et al. (1991) gives some of the earliest approximations for the mean-square error 
of CMA2-2 under the assumption of Gaussian regression vectors. The work by Bershad and Roy 
(1990) is also an early work on the performance of CMA2-2 albeit for a particular class of input 
signals that are modeled by Rayleigh fading sinusoids. The work by Zeng and Tong (1997) studies 
the mean-square-error of the optimal CM receiver but the effects of adaptation and gradient noise 
are not considered. The work by Fijalkow, Manlove, and Johnson (1998) obtains an expression for 
the mean-square error of CMA2-2 using Lyapunov stability and averaging analysis arguments. Their 
MSE expression is the closest to the results in Probs. IV.16 and IV.17. 


An application to echo cancellation. There are two types of echoes in communications sys- 
tems: line echoes and acoustic echoes. Line echoes in voice communications occur over telephone 
lines due to circuit imperfections and impedance mismatches (see, e.g., Sondhi and Berkley (1980)). 
Among the many techniques that have been developed over the years to control the echoes (including 
echo suppressors), adaptive echo cancellation seems to be the most effective way and is the method 
of choice in modern implementations. It was first reported in Sondhi (1967); actually, in his arti- 
cle, Sondhi recognizes J. L. Kelly Jr. of Bell Laboratories as being the original proposer of using 
adaptation for echo cancellation purposes (see Kelly and Logan (1970)). 

In Computer Project IV.1 we study an adaptive echo canceller implementation. In that project, an 
adaptive filter is used to estimate the echo path and to subsequently generate a replica of the echo in 
order to cancel it. It should be mentioned that, in practice, a complete echo canceller implementation 
would need to perform additional tasks, besides echo cancellation and adaptation, in order to avoid 
distorting the speech signals. Among these tasks, we may mention the need to identify signaling 
tones in order to avoid cancelling them, as well as the need to identify double-talk conditions (i.e., 
situations when both speakers are simultaneously active) in order to freeze adaptation and avoid filter 
divergence. In addition, the design should account for the effect of finite-precision computations on 
the performance of the echo canceller. Later, in Computer Project VI.1 we shall study acoustic echo 
cancellation, as opposed to line echo cancellation. 


Nonstationary Environments 


Energy conservation. The energy-conservation relation (20.24) is the extension to nonstationary 

environments of relation (15.32), which was originally derived by Sayed and Rupp (1995) in their 

studies on the robustness and small gain analysis of adaptive filters (see Chapter 44). The extension 

(20,24), along with its variance relation (20.32), were derived by Yousef and Sayed (1999a,2001,2002) 
and used therein, as well as in subsequent works by the same authors, to study the tracking perfor- 

mance of adaptive filters in a unified manner. The presentation in this chapter follows mainly Yousef 

and Sayed (2000a,2001,2002). In the work by Shin and Sayed (2004), the extension of the tracking 

analysis to affine projection algorithms is presented. 


EMSE of adaptive filters. In Chapter 21 we evaluated the tracking EMSE for several particular 
choices of g[-] in (20.21). In Prob. 7.4 of Sayed (2003) we derive a general expression for the EMSE 
of adaptive filters corresponding to generic choices of g[-] (with g[-] being solely a function of e(-)). 
We do so by appealing to a Gaussian assumption on the distribution of €a(i), and by using this 
assumption to rewrite the variance relation in an alternative form. It is shown in the aforementioned 
problem that if we introduce the functions: 


Re(Eez(i)g) 

hy Ê E| gle(i)] |? so 

U | gle(2)] | and he Ele? 

then {hy, hc} are solely functions of E|ea(i)|? and, moreover, the variance relation leads to the 
following equation in terms of the desired filter EMSE, 


_ bTr(Ru)hy (6) + p Tr(Q) 


¢ 2hs(C) 


This result indicates that the EMSE can be obtained as the fixed-point of a nonlinear equation in ¢. 
In other words, given g[.], we can evaluate the functions hy and ha under the assumed conditions, 
and then proceed to solve the above nonlinear equation for Ç. Problems 7.5-7.8 of Sayed (2003) 
illustrate this method of computation and show that its results are consistent with the results obtained 
in this chapter. 


Tracking results. Among the earliest works on the tracking performance of adaptive filters are 
those of Widrow et al. (1976) and Bershad et al. (1980). The former uses a random-walk model 
for the variations in the optimal weight vector, while the latter assumes deterministic variations 
in the optimal weight vector. Both works focused on the transient performance of LMS but were 
not concerned with its steady-state performance. Steady-state results appeared subsequently in Far- 
den (1981), Benveniste and Ruget (1982), Walach and Widrow (1984), Eweda and Macchi (1985), 
Eleftheriou and Falconer (1986), Marcos and Macchi (1987), and Benveniste (1987). The work by 
Benveniste and Ruget (1982) was apparently the first to compare the tracking performance of differ- 
ent adaptive algorithms. More recent analysis for other classes of adaptive filters appear in Eweda 
(1990b,1994,1997), Hajivandi and Gardner (1990), Cho and Mathews (1990), Guo (1994), Bahai 
and Sarraf (1997), and Rupp (1998b). The tracking results in Lemma 21.3 for sign-error LMS agree 
with those derived by Cho and Mathews (1990) and Eweda (1990b). The tracking results for LMMN 
are from Yousef and Sayed (19990,2001). The comparison results of the tracking performance of 
RLS and LMS agree with those derived by Eweda (1994). 


Random-walk model. It is customary in the literature to use the random-walk model (20.10) — 
see, e.g., Haykin (2000, p. 644) and Macchi (1995). However, as explained in Sec. 20.2, this model 
is not necessarily meaningful since the covariance matrix of w? grows unbounded. In Probs. IV.29- 
IV.33, we show how to study the tracking performance of adaptive filters by relying instead on model 
(20.14). In the computer project at the end of the chapter, we provide an example that shows how 
models of the form (20.14) arise in applications. 


Random and cyclic nonstationarities. Besides random channel variations, another source 
of nonstationarity that is common in communication systems is due to mismatches between the 
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transmitter and receiver carrier generators (or clocks). Such mismatches result in periodic system 
variations, which can be damaging to the performance of adaptive filters, even for small carrier 
frequency offsets. Examples to this effect appear in Bahai and Sarraf (1997) and Rupp (1998b). The 
ability of adaptive filtering algorithms to track such periodic system variations has received little 
attention in the literature, except perhaps for the works by Hajivandi and Gardner (1990), Rupp 
(1998b), and Yousef and Sayed (2002). 

The work by Rupp (1998b) uses a first-order approximation to examine the performance of LMS 
in the presence of carrier frequency offsets only. In Probs. IV.29-IV.33, we follow the work of Yousef 
and Sayed (2002) and show how the energy-conservation approach used in the body of the chapter 
can be applied to study the performance of a variety of adaptive filters in the joint cases of random 
and cyclic nonstationarities. 


Tracking Rayleigh fading channels. In Computer Project IV.2 we illustrate the tracking abil- 
ity of LMS in the context of a Rayleigh fading and multipath channel. Such channel models are 
widely used in modeling wireless communications environments (see, e.g., Viterbi (1995), Rappa- 
port (1996), and Verdu (1998)). In the project we describe some of the basic concepts that are used 
in characterizing Rayleigh channels. The pioneering work on the characterization of fading and mul- 
tipath conditions is due to Price (1954,1956). Other early contributions include Price and Green 
(1958) and Kailath (1960b, 1961). 


Finite precision effects. The performance of adaptive filters is affected adversely when they 
are implemented in finite-precision arithmetic due to roundoff errors. For example, quantization 
errors may affect the stability of an adaptive filter and ultimately lead to its divergence. They can 
also degrade the steady-state performance of the filter causing it to attain a higher mean-square error 
than what is expected from an infinite-precision analysis. The performance degradation tends to be 
more serious for recursive-least-squares (RLS) algorithms as opposed to least-mean-squares (LMS) 
algorithms. 

Quantization errors propagate in a highly nontrivial manner, and studying their effect on adaptive 
filter performance requires several assumptions on how roundoff errors arise. Chapter 8 of Sayed 
(2003) describes a procedure for evaluating the effect of quantization errors by relying on the same 
energy conservation arguments that we have used so far in our exposition. The analysis there shows 
that the effect of roundoff errors on filter performance can be analyzed in a manner similar to how 
we studied the effect of channel nonstationarities on tracking performance in the current chapter. 
Specifically, the main conclusion from Sayed (2003, Ch. 8) is that, for sufficiently small step-sizes, 
the approximate EMSE of an adaptive filter in a quantized environment can be obtained from its 
EMSE in a nonstationary environment by substituting (Q, Ru,o2} by: 


QeQ-2e21u, Ru=Rutoriu, oi xo? + 3 


where p p 

K K 
It is assumed that the entries of the weight vector, wi, are quantized to Be bits (assumed large 
enough) with saturation level Lo and the entries of {d(i), ui, uw; 1, e(i)) are quantized to Br 
bits with saturation level L,. Moreover, x = 1 for real data and x = 2 for complex data. For 
example, the EMSE for a finite precision implementation of LMS would be given by 


QN I pos Tr(Ra) + u^ Tr(Q) 
2 2 — uTr(R,) 


Finite precision results. There have been extensive studies in the literature on the effect of 
quantization errors on adaptive filter performance. One of the earliest studies of finite-precision 
effects on LMS performance was performed by Gitlin, Mazo, and Taylor (1973) followed by Weiss 
and Mitra (1979). Afterwards, Caraiscos and Liu (1984) examined the effect of finite word-length 
conditions on the steady-state filter performance assuming floating-point arithmetic, while Alexander 
(1987) examined the effect of finite word-length on the filter transient performance. À discussion 


of limited precision effects on filter performance also appears in Cioffi (1987) and Sherwood and 
Bershad (1987). 

Further results for LMS, NLMS, and sign-regressor LMS appear in Chang and Willson (1995), 
Bermudez and Bershad (1996ab), Eweda, Yousef, and El-Ramly (1998), and Eweda, Younis, and 
El-Ramly (1998). In the works by Bermudez and Bershad (1996ab), the authors develop a model 
to account for nonlinearities in the quantization process, including the occurrence of underflow, and 
use the model to study the performance of quantized LMS. The extension of the energy conser- 
vation approach to the case of finite precision implementations can be found in Yousef and Sayed 
(2000b,2003) and Sayed (2003, Ch. 8). 


Drift problem. The fact that the LMS filter can produce unbounded weight estimates in some 
situations is illustrated in Prob. IV.39. This so-called drift problem has been described in several 
references including, for example, Gitlin, Meadors, and Weinstein (1982), Ioannou and Kokotovic 
(1984), Cioffi and Werner (1985), Sethares et al. (1986), Cioffi (1987), and Rupp (1995). The ref- 
erence Ioannou and Kokotovic (1984) provides an analysis in the adaptive control context. The 
reference Sethares et al. (1986) studies the drift problem in a deterministic infinite-precision setting, 
while the references Cioffi and Werner (1985) and Cioffi (1987) consider finite-precision effects. 
They show that even with zero noise, unbounded growth (i.e., drift) of the weight estimates can hap- 
pen due to finite-precision arithmetic errors. Such unbounded growth of the LMS estimates usually 
happens if two conditions are satisfied: 


1. The regressor covariance matrix, Ru, is singular. 


2. The noise or the finite-precision arithmetic errors have nonzero mean (zero mean variables 
can become non-zero mean due to finite precision errors — see Prob. IV.38.) 


One of the earliest propositions to deal with the lack of sufficient excitation (i.e., with a singular 
Ru) was by Zahm (1973), in which it was suggested to add a small amount of white noise to the 
regression data (a procedure known as dithering). However, the leakage-based solution is nowadays 
the preferred way to go. 


Circular-leaky LMS. As explained in Prob. IV.39, leaky-LMS helps ameliorate the drift problem 
of LMS. However, this solution comes at the expense of biased weight estimates. In Nascimento and 
Sayed (1999), a leaky variant (called circular-leaky LMS) is proposed that avoids the drift problem 
and guarantees unbiased estimates. There are two modifications with respect to leaky LMS in the 
circular-leaky variant. First, leakage is applied to a single tap at each iteration and, second, leakage 
is applied only if the tap magnitude exceeds a pre-specified level. 

Table IV.1 summarizes the properties of LMS, leaky-LMS, and circular-leaky LMS for compar- 
ison purposes. In the complexity column, we list approximate values for the number of multipli- 
cations, additions, multiply-and-accumulate (MA), and if-then (IF) commands necessary for each 
algorithm; assuming real data. 


TABLE IV.1 Comparison of three LMS filters. 


Dritt Biased — | Complexity — | 
problem | when Ru >0 | MA| x __| + | IF | 


(| circutar-teakyLMS | NO | — NO — [2M | 3 | 2 | 3 | 
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PROBLEMS 


Problem IV.1 (Auto-regressive process) A unit-variance white-random process s(i) is fed into 
a first-order auto-regressive model with transfer function V1 — a?/(1 — az~*), where a is real. 
The output process is denoted by u(z); it is referred to as an auto-regressive process of order 1, 
written as AR(1). Assume |a| < 1. Show that the auto-correlation sequence of u(i) is given by 
r(k) = Eu(i)u(i — k) = a!*!, for all integer values k. If u(i) is fed into an adaptive filter of order 
M, what is the covariance matrix of the resulting regressor u;? 


Problem IV.2 (Finite alphabets) Consider an adaptive filter with a regressor vector w; that pos- 
sesses shift-structure, namely, u; = [ u(i) u(i—-1) .. uli- M +1) |: The entries of 


u(i) are realizations of a binary random variable, i.e., they are +1 with probability 1/2. Assume 
initially that all variables are real-valued. 


(a) Show that the EMSE that would result when the filter is trained using LMS is given by 
CMS = Mo2/(2 — uM). Show that this result is exact, i.e., it holds irrespective of any 
approximations. 


(b) Assuming e « M, show that a similar conclusion holds when the filter is trained using 
e—NLMS with ÇS = yo? /(2 — p). 


(c) Likewise, assuming e < 1, show that for power-normalized e~NLMS it holds that (^ PICM5 — 
uMo2/(2 — uM). Remark. Recall from the discussion prior to (11.9) that the step-size for e-NLMS 
with power normalization is in general M times smaller than the step-size for e-NLMS and, therefore, 
the above expression for the EMSE agrees with that of part (b) if ys is replaced by L/M. 


(d) Show that the EMSE of sign-error LMS would be the same as in Lemma 18.1 with Tr( Ru) 
replaced by M, and that this result holds under the same assumptions stated in the lemma. 


(e) Show that the EMSE of LMF would be ('M* = 4Mz$/(602 — 15uM£4). Under what 
approximation does this result hold? 


(f) How would the results change if the u(i) arise from a QPSK constellation instead? 


Remark. The separation principle (16.7) is also exact in the case of regressors with modulated inputs, e.g., as in 
Pfann and Steward (1998). 


Problem IV.3 (Second-order filter) Consider an adaptive filter with a 2-dimensional regression 
vector u; = | u(i) u(i — 1)]. The entries (u(7)) are independent random variables satisfying 


; a with probabilit 
u(j) = e seeds 
b with probability q 
where a and b are real numbers with a > b. 
(a) Determine the values of p and q, and conditions on a and b, such that w; is zero mean. 
(b) Find the covariance matrix of w;. 


(c) Find an expression for the EMSE that would result when the filter is trained using LMS. 
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(d) Repeat part (c) when the filter is trained using e—NLMS. 


Problem IV.4 (Combination of adaptive filters) Consider a combination of two adaptive fil- 
ters as shown in Fig. [V.1. One filter has a 2 x 1 weight vector w1,; while the other filter has a 
3 x 1 weight vector w2,;; both at the same time instant 7. The regression sequence for the top filter 
is denoted by {u1,:} and the regression sequence for the bottom filter is denoted by (u2,;). Each 
regression sequence is i.i.d. and independent of the other. The individual entries of u1,; are chosen 
independently of each other and lie on the unit circle. Likewise for w2,; except that its entries lie on 
the circle of radius /2. 


d(i) 


e(i) 


U2,i 


e2(i) 


FIGURE IV.1 A combination of two adaptive filters. 


Both adaptive filters employ the same reference sequence d(i) to generate their respective output 
errors: 
ei(i) = d(i) — ui wisi, e2(i) = d(i) — us W2,i-1 
and the weight vectors are updated according to the following rules: 


Uli ; i 
Wii = Wirt mE e1(t), Wei = W2,i—1 + H2 u5, ex(i) 
[[1,a|? 


B 


Let (y (i), yo(i)} denote the outputs of the adaptive filters, 
Y (i) = uwiici — Yo(t) = U2,iwei-r 


These outputs are combined by means of a nonnegative scalar 0 < A < 1 in a convex manner to 
generate the output of the combined adaptive structure as follows: 


y) = Xy (i) + (1 7 Xvi() 
Assume that the data (d(i), wi,:, w2,i} satisfy the following conditions: 


1. There exists a vector w? such that d(i) = ui;w1 + vi(i). 


2. There exists a vector w3 such that d(i) = w2,iw3 + va(i). 


3. The noise sequences (vi (i), va(i)) are i.i.d. with variance o2. 


4. The noise sequences {v1 (i), va(2)) are independent of each other and of (u1,;, u2,;) foall i, j. 


5. The initial conditions (201,1, w2,-1} are independent of all (d(j), w1,;, 2,3, vi (7), v2(7)). 
6. All random variables have zero means. 


(a) Find exact expressions for the EMSEs of the two adaptive filters, w1i,; and wo. 
(b) Find an expression for the EMSE of the combined adaptive structure in terms of the EMSEs 


of the individual filters and A. Determine an optimal value for A in order to minimize the 
EMSE of the combined structure? Compare the optimal EMSE to the individual EMSEs. 
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Remark. For results on a more general covnex combination of adaptive filters, see the work by Arenas-Garcia, 
Figueiras-Vidal, and Sayed (2006). 


Problem iV.5 (EMSE of e-NLMS with power normalization) The purpose of this problem is to 
extend the derivation of Sec. 17.1 to e-NLMS with power normalization, as described by (19.22)- 
(19.23). 


(a) Repeat the arguments that led to (17.6) to conclude that the EMSE of the above algorithm is 
still given (17.6) where now œu = E (|ui||?/(e + p(1))?) and nu = E (1/(e + p(i))). 


(b) Assume e is small and that the regressor is circular Gaussian with covariance matrix Ry, = 
c? I. Show that 
1-8 


Elui? = Mei, E pi)  ex(1 - 6°"), EPO = aoi - BY) 


where ^y = 3 for real data and y = 2 for complex data. Hint. Recall that the finite sum a + ar + 
... + art, with initial term a and ratio r, evaluates to a(1 — r**1)/(1— r). When i — oo, the sum 
becomes a geometric series and if |r| < 1, it converges to a/(1 — r). Recall also that if æ is a Gaussian 
random variable, then E |z|4 = 204 when æ is complex-valued and circular and E x* = 304 when c is 
real-valued. 


(c) Under the conditions in part (b), use the approximations ay c E ||u;||?/E p?(i) and nu © 


1/E p(i)as i — oo, to justify the expression: 


2 
QF-PNLMS — — EMG 1 (Gaussian regressors) 


~ 2y(1 — 8) - uM(14- 8) 
for small step-sizes. 
Remark. See Sayed (2003, Sec. 6.6.3) for further details. 


Problem IV.6 (EMSE of LMF and LMMN) Consider the LMMN algorithm (19.24). The least- 
mean fourth (LMF) algorithm corresponds to the special case ó — 0. Clearly, 


gle] = les + v] + (1— 6) [ea -- v] [leal + lv]? + ezv + ev” ] 


where we are omitting the time index i for compactness of notation. Introduce the symbols ô = 1—6, 
€4 = E|v(i)|*, and £$ = E|v(<)|®, where the scalars (£4, £9) denote the fourth and sixth-order 
moments of v(i), respectively. The data (d(i), ui, v(z)} are assumed to satisfy model (15.16). 


(a) Assume first that all data are real-valued. Using the fact that e; (i) and v(i) are independent 
(cf. Lemma 15.1), and ignoring third and higher-order powers of e, (i), justify the expressions 

E es (i)gle(1)) 

E [u:ll^ g^ [ei] 


bE e2(i) 
aTr(Ru) + cE ||u;|e3(i) + 888(E |ui] ea (i)) Ev? (i) 
+ 68? (E uii? e (?)) E v? (3) 


q 


gQ 


where the constants {a, b, c} are defined by 


a & o+ 265¢4 + 8^9, b 2 643502, c È 6? 126802 + 1557e4 


Conclude that the variance relation (15.40) leads to 


2bEe2(i) = paTr(Ru) + pcE |lui||7e2 (i) + 8u68(E ||ui| es 2)) E v? (i) 
+667 (E |u;| es ()) Ev?(i), i oc 


In order to simplify this result, we may consider three cases (as in our study of the mean- 
square performance of LMS in Chapter 16). 


(b) 


(a.1) Sufficiently small step-sizes. Argue that when + is sufficiently small we get 


qu — FR TR), (M= E (&) Tr.) 


(a.2) Separation principle. For larger values of u, use the separation assumption (16.7) and 
the steady-state assumption (16.12) to conclude that 


ÇLMMN = paTr(Ru) (MF — pé Tr(Ru) 
2b — pcTr(Ru)’ 602 — 15yé4Tr(R.) 
(a.3) White Gaussian regressors. Assume that the regression vector w; is Gaussian with 
covariance matrix R,, = o?l, and ignore the terms in v?(i) and v?(i). Repeat the 
argument of Sec. 16 to conclude that 


^ 602 — 15u(M + 2)0264 


(MMN — uMoia qur n Mots 


^ 2b — uc(M + 2)e2' 


Assume now that the data are complex-valued. Show that the same results of part (a) still hold 
with the modifications b = 6 + 2602 and c = 67 + 86002 + 98764. 


Remark. The argument in this problem follows the approach of Yousef and Sayed (2001) — see also Sayed (2003, 
Sec. 6.8) for further details. 


Problem IV.7 (EMSE of e-APA) The purpose of this problem is to extend the energy-conservation 
approach of Sec. 15.3 to evaluate the EMSE of the e-APA algorithm (19.25)-(19.27). 


(a) 


(b 


m^ 


(c) 


(d) 


Introduce the a priori and a posteriori error vectors €a = Uiwi-1 and ep; = Uiwi, 
where ip; = w? — wi. Verify that e; = eai — nU&U; (el + U;UT) "| e; and establish 
the relation 


|i; [|? + eži (U;UT) ^ ea; = õi]? + e; (UI) ^ es 


This equality extends the energy-conservation relation (15.32) to the e—APA case. 


Introduce the matrix quantities A; = (el --U;U?) | U;U7 (c1 -- U;UT) | and B; = 
(el + U;UT) ^. Show that, in steady-state and under expectation, the energy-conservation 
relation of part (a) reduces to 


HE [e? Aiei] =E [ez,;Biei] +E [ef Bieai], asi co 


Ignore the dependency of w#;_1 on prior noises in steady-state. This variance relation is the 
extension of (15.40) to the e—APA case. 


Consider the following assumptions: 


(c.1) The data {d(z), wi} satisfy conditions (15.16). 
(c.2) The dependency of #;_1 on prior noises is negligible in steady-state. 


(c.3) Separation principle. At steady-state, U; is independent of €a,; and Eee - 
(E les (1)|?) - S, where S ~ I for small u and S ~ bobj for larger i. Here bo denotes 
the first basis vector, bo = col{1,0,..., O}. 


Argue that (^^ ^^^ = yo2Tr(E.A:)/(2nu — uo), where o, = Tr(S - E[Ai;]) and nu = 
Tr(S - E[B,]). 


Two simplifications are possible when the regularization parameter e is sufficiently small. 
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(d.1) Using A; ~ (U;UT) ! and S z I, verify that (*^^^ zz uo2/(2 — u). On the other 
hand, using S z bob], verify that 


uc? Tr(E Ai) 


(67 APA A b 


~ G4) EA(0,0) 


where A;(0, 0) denotes the (0,0) entry of A;. 
(d.2) Use the approximations Tr(E Ai) & E (K/|ju;[?) and Tr(S E [Ai]) © 1/Tr( Ru) to 
justify 


e-APA _ ucl K 
oe = aus TE (rar) 


Remark. The separation condition in part (c.3) is motivated in Prob. IV.8. The derivation in this problem follows 
Shin and Sayed (2004) and also Sayed (2003, Sec. 6.10). 


Problem IV.8 (APA condition) The purpose of this problem it to motivate the second approxima- 
tion in condition (c.3) from Prob. IV.7, namely, that at steady-state Ees,;e2, = (Elea(i)|*) - S. 
where S ~ I at small ys and S ~ bob at larger p, with bo = col{1,0,..., 0). 

Consider the a priori and a posteriori error vectors 


AU 1l Ubi 
Ui-1Wi-1 Uj-1Wi 
Cai = 1 , epi = 
Ui-K+1Wi-1 UWi-K4+1Wi 


(a) Verify that for small e, e5,; = ea, — wei = (1— wea — Hvi. 
(b) Conclude that the variances of the second and third entries of e,,; satisfy 


(1 — u*E|es(i — 1)? + p'o? 


Elu-zwial = (1—y)*Elea(é— 2)? + (1 — i)! Wop + wot 


E|ui ito; il? 


Ignore any dependency between 15; and v(i — 1). 


(c) Use similar arguments to approximate the variances of the other entries of e,;. Specifically, 
use the steady-state condition 


Ele,(G)? = Ele.(i 1)? =- = Eles(i - K -1)?, as i oo 
and neglect the off-diagonal terms of E e;,;e; ;, to argue that 
Ee,;ez;Ele«(i) Di + u?o2Do, asi oo 
where the diagonal matrices ( Di, D2} are defined by 


Di 


1 


diag {1, (1 — u* = 1... (= i7?) 


Da diag {0,1,1+ (1— 4)... 14 (0 7 y^] 


Note that when , is close to 1 and when the noise variance is relatively small, Dı & b bo and 
02 Ds © 0. Likewise, when p is relatively small, we get D; ~ I. 


Problem IV.9 (Filters with data nonlinearities) Consider adaptive filters with update equations 
of the form 

wi-wi-itpgug(ujuie(i), wc: = initial condition 
for some positive scalar-valued function g[-] of the regression data, i.e., g[u;] > 0. Verify that the 
energy conservation relation of Thm. 15.1 still applies. Verify further that (e (i), ea (i), e(i)) are 
related via e; (i) = ea(i) — p||will?g[u:]e (2). 


Problem IV.10 (Price’s theorem) Let a and b be scalar real-valued zero-mean jointly Gaussian 
random variables and denote their correlation by p = E ab. Price’s theorem states that, for any 
function f of (a, b) (for which the required derivatives and integrals exist), the following equality 


holds (Price (1958)): 
9"Ef(a,b) |, 8?" f (a, b) 
Op” O0a^Ob" 
in terms of the n-th and 2n—th order partial derivatives. [In simple terms, Price's theorem allows us 
to move the expectation on the left-hand side outside of the differentiation operation.] 


(a) Choose n = 1 and assume f(a, b) has the form f(a,b) = ag(b). Verify from Price's 
theorem that OE ag(b)/8p = E dg/db, in terms of the derivative of g(-). Integrate both sides 
over p to establish that Eag(b) — (Eab) - Edg/db. 


(b) Show further that Ebg(b) = a? Edg/db and conclude that the following relation also holds: 


Eab 
Eag(b) = =F -Ebg(b) 
b 


(c) Assume g(b) = sign(b). Conclude from part (b) that 
: 2 1 
Ea sign(b) = "H 5 Eab 


Problem IV.11 (Price's theorem for complex sign function) Let u and e denote two jointly- 
Gaussian and complex-valued random variables, where e is scalar-valued and has variance c2. Let 
€ = e, + jei, U =u, + jui, and assume that 


1) The real parts of u and e are jointly Gaussian. 

2) The imaginary parts of u and e are jointly Gaussian. 

3) The real and imaginary parts of e have identical variances. 

4) The real parts of {u, e) are independent of their imaginary parts. 


The third condition means that Ee? = Ee? so that o2 = 202. Using the definition of the sign 
function for a general complex number from Prob. III.15, namely, csgn(e) = sign(e;) + jsign(e;), 
verify that ERe{u*csgn(e)} = E [ursign(er) + wisign(e;)]. Use Price’s theorem for real-valued 
data from Prob. III.15 to conclude that 


E Re{u* csgn(e)) = E ue E Re{u*e} 


Problem IV.12 (Bussgang’s theorem) Bussgang’s theorem is a special case of Price’s theorem. 
Let (a, b} be two real zero-mean Gaussian random variables and define the function 


b 
go) [ e az 
0 


for some o > 0. Bussgang's theorem states that 


The proof of the theorem is as follows. Let p — E ab. Use Price's general statement from Prob. IV.10 
to verify that 


8Eag(b _ .(Oag(bN (vs 
ap - «(3S =E(e ) 
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Integrate both sides of the result of part (a) over p to establish Bussgang’s theorem (Bussgang (1952)). 


Problem IV.13 (Useful identity) Consider two real-valued zero-mean jointly Gaussian random 


variables (z, y) with covariance matrix 


e[e]f=vl=[1 2] 


That is, (z, y) have unit variances and correlation p. Define the function 
2 © [Y _—a?/20? 8/20? 
g(z,y) = eal f e e * dadg 


for some c > 0. 
(a) Verify that O?g(z, y)/Oxdy = aet fao! qot far and show that 


E [3 202 1 


8z0y | = Jo? +1? p 


(b) Integrate the equality of part (a) over p € (0, 1) and conclude that 


! e [879(m y) LR 1 
| E E dp = foe (3) 


(c) Use Price’s identity (cf. Prob. IV.10) to conclude that 


2 ; 1 
Eg(z,y) = 2 arosin (5) 


Problem 1V.14 (Performance of constrained LMS) Refer to the constrained LMS filter stud- 


ied in Prob. 111.28, namely, 


wi = wi-i + uPuf[d(i) - uiwii] 


where the matrix P is defined by P = I — cc* /||c||? for some known M x 1 vector c. Moreover, the 
initial condition w.; is such that c'w.; = a, where a is a known real scalar. Assume that the data 
{d(i), wi} satisfy the stationary model (15.16) with the additional requirement that the unknown 


model w° is such that c*w° = q as well. 


(a) Introduce the weighted a priori and a posteriori errors ef (i) = u;PWi-1 and ef (i) 
P 


vu, Pi, where ip; = w° — wi. Verify that ef (i) = e£ (i) — uluill2e(i), where e(i) = 


d(i) — u;w,-1 and the notation ||u;||5 stands for u; Pu. 
(b) Follow the arguments of Sec. 15.3 to show that the following energy relation holds: 


= 2 tob 1 
Meer m 


mrle O? = lõi]? + lez (D? 
PAFS 


(c) Verify that e(i) = ea(i) + v(i), where ea(i) = u;ti-1. Use the steady-state condition 
E {jm ||? = E||@.-1||? as i — oo, to conclude that the above energy relation leads to the 


following variance relation 
HE ||uilible()|? = 2Re[Ees *(dea(i)], as i— oo 
Use the fact that c" b; 1 = 0 to verify that the variance relation reduces to the following 


HE ||usl[ ea (1)? + pos Tr(PRu) = 2Efee(i)|?, as i — oo 


(d) Argue that for sufficiently small step-sizes, the EMSE of constrained LMS can be approxi- 
mated by Conerained LM = po2Tr(PR,)/2. 


(e) Use instead the separation assumption (16.7) to conclude that the EMSE can also be approxi- 
mated by Constained-LMS — 52 THPR.)/(2 — pTr(PRu)). 


Problem IV.15 (Performance of RLS by independence) Refer to the RLS algorithm studied 
in Chapter 19, namely, (19.2)-(19.1) with initial condition w- = 0. Assume further that the data 
(d(i), ui) satisfy model (15.16). 


(a) Show that w; satisfies the equation w; = Pisi, where s; = $m Ai-?utd(j). Remark. 
The equations w; = Pis will be encountered later in (30.21) when RLS is motivated and derived as 
the exact recursive solution to a least-squares problem. At that point, we will refer to w; = Pis; as the 
normal equations. 


(b) Use the data model (15.16) to show that 


i 
: 3 o " . i-j * a 
ME SM + ds (e nj) 


j-0 
paewai = sm E |P (Zujus) n] 
j-0 


(c) Asin (19.11) and (19.12), replace P; by (1—A)Rz! and (Ss) X6-Puju;) by R,/(1— 
A?). Conclude that lim E wÙ} = e2(1 — A) Rz /(1-- A). 


i-—0o 


(d) Use the independence condition (16.12) to conclude that the EMSE of RLS is given by 
EMSER!S = g2(1 — A) M/(1 + A). Compare this expression with the result of Lemma 19.1. 


Problem IV.16 (Performance of CMA2-2 for real-data) In this problem we evaluate the mean- 
square performance of the CMA2-2 recursion, 


w;-wiictgulz()w-z(). z(i-wwia, i20 


where jz is a positive step-size and "y is a positive scalar (see Alg. 12.6). The main issue that arises 
while studying CMA2-2, in contrast to the adaptive filters studied in the body of the chapter, is the 
absence of a reference sequence d(i) and, correspondingly, of an explicit weight vector w° relative 
to which we can define the weight error vector, tb; = w° — w;. This issue can be handled as follows. 


CMA2-2 is usually used in the context of channel equalization in communications — see Prob. III.4. 


Thus let (s(i)) denote symbols from a constellation that are transmitted over a communications 
channel. The received data (i.e., the output of the channel) are denoted by u(i) and are fed into an 
FIR equalizer that is trained by CMA2-2. The purpose of the equalizer is to reproduce the transmitted 
data, say to recover {s(i — A)} for some delay A. 

Now assume that the equalization problem is such that there exists a receiver w^ that is able to 
reproduce the transmitted data {s(i)} with some delay A, i.e., such that s(; — A) = u;w°. The 
task of CMA2-2 then becomes that of attempting to estimate w^ so that the output of the equalizer, 
z(i) = wiwi-i, would tend to the desired symbol s(i — A). The mean-square performance of 
CMA2-2 is then measured in terms of how successful it is in achieving this objective, i.e., in terms 
of the steady-state value of E |s(i — A) — uiwi l7. 

In this problem we focus on the case of real-valued data (s(i)), e.g., data from a PAM constel- 
lation. Problem IV.17 extends the results to complex-valued constellations. We may add that for 
CMA2-2, one choice for the scalar y is as y = E|s(2)|*/E|s(i)|?. The analysis below, however, is 
for general y. 


(a) Verify that the energy-conservation relation of Thm. 15.1 still holds for CMA2-2, namely, that 
for any data {wi}, ||; ll? gi)les (i)? = [[ioi-il?-nG)les (P^, where ea (i) = uiibi-i. 
epli) = uiw, Wi = w? — wi, and ii(i) is defined as in (15.31). 
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(b) Verify also that the variance relation of Thm. 15.2 still holds, namely, for any data (ui], 
HE ||us||?9? [z(@)] = 2Ees(i)g[z(i)] as i — oo. Using g[z] = z[y — z?], show that this 
relation reduces to 


HE uil z^ (Oly - z (P = 2Ees()2()y - z^()] 


(c 


— 


Introduce the following assumptions: 
1. The transmitted signal s(i — A) and the estimation error e,(i) are independent 
in steady-state, so that E s(t — A)es(i) = 0 since s(i — A) is assumed zero mean. 
2. At steady-state, ||u;||? is independent of z(i). 
This condition extends the separation condition (16.7) to the CMA case. 


Using assumptions 1. and 2., and replacing z(i) by s(i — A) — ea (i), verify that 


HE us] z^ G)py — 27)? 
2E ea(i)z(i) fy — z?(i)] 


where A = pE [(7? — 12s? + 9s*)e2] E||u|? + E [15s?e2 + e$ — 2ye4] E |jul|?. In 
the above expressions, we dropped the time index from (s(i — A), €a (1) for compactness of 
notation. Assume pz and e2 are small enough so as to ignore the terms E e? and A. Conclude 
that the MSE of CMA2-2 is given by 


pE (5*5? — 2ys* + 89) -Elu|? +A 
2b (-e24 + 3s?e2 + ez) 


E (4? 8? — 2ys* + 8°) 


CMA2-2 ,, 
MSE QE 2E (38? — 7) 


Tr(Ru) 


in terms of the second, fourth, and sixth moments of the constellation. 
Remark. The results of this problem, and those of Probs. IV.17-IV.18, are based on the work by Mai and Sayed 
(2000). 


Problem IV.17 (Performance of CMA2-2 for complex-data) In this problem we extend the 
analysis of Prob. IV.16 to complex-valued constellations, in which case CMA2-2 is given by 


wi = wiaicguiz)v-iG)] zG)-uwwii i20 


We now assume, for generality, that there exists an equalizer w° that reproduces the complex data 
{s(é)} up to some rotation 9, i.e., u;w?^ = s(i — A)e7?. The purpose of the mean-square perfor- 
mance analysis is to evaluate how well the output of the CMA2-2 implementation, namely, z(i) — 
uiWwi—ı, can approximate s(i — A)e??. 


(a) Verify that the variance relation of Thm. 15.2 still holds, namely, for any data {u;}, 
HE |ui lg(sG)]I^ = 2Re(Eez()g(s()]), as i > oo 


(b) Now introduce the following assumptions: 
1. The transmitted signal s(; — A) and the estimation error e,(i) are independent in 
steady-state, so that E s* (i — A)ea(i) = 0 since s(i — A) is assumed zero mean. 
2. At steady-state, |ju;lj? is independent of z(i). 
3. The data s(i) is circular, E s?(i) = 0. 
4, The scalar y satisfies E (2|s(i)|? — y) > 0. 
Repeat the derivation of part (c) of Prob. IV.16 to show that the MSE of CMA2-2 is now given 
by 

E (y^ls? — 2s|* + lal") 

2E (2|8|? — 7) 


in terms of the second, fourth, and sixth moments of the constellation. 


MSECMA2—2 mu 


Tr(Ru) 


(c) Assume the data (s(i)) have constant-modulus, |s(;)| = 1, and choose y = 1. Verify that in 313 
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Problem IV.18 (Performance of CMA1-2 for complex-data) In this problem we extend the 
analysis of Probs. IV.16 and IV.17 to CMA1-2, which is described by the following recursion (see 
Alg. 12.5): 


z(i) 


Wi = Wi-1 + uui Gear - s] , £Z(i-—wjiwi-i, i20 


Iz (i) 


We also assume, for generality, that there exists an equalizer w^ that reproduces the complex data 
{s(i)} up to some rotation @ and delay A, i.e., uw? = s(i — A)e?®, For CMAI-2, one choice for 
the scalar y is y = E |s(i)|?/E |s (i)|. The results below, however, are for general y. Compared with 
CMA2-2, the function g[z] is now given by g[z] = (yz/|z|) — z. In addition to the assumptions in 
part (b) of Prob. IV.17, introduce the following: 
1. The output z(i) is distributed symmetrically around the transmitted signal s(i — A) in 
steady-state, so that E|z(i)| = E|s(i — A)|. 
2. The estimation error ea (i) is independent of z(1)/|z(1)| in steady-state and E z(i)/|z(i)| = 
0 so that Ees(i)z(i)/|z(i)| = 0. 


(a) Consider again the variance relation pE ||u;||?|g[z(1)]|? = 2Re(E ez(i)g[z(i)]), as i — oc. 
Verify that it reduces to 


PE ul? - E(? — 2524 + lz?) = E c Ga ~) +60 (7% - z) ) 


(b) Replacing E |s| = E |z| and E |z|? = E |s|? + E |ea]|?, and ignoring the term 4E les]? ||u||? 
when p and es (2) are small, conclude that the MSE of CMA1-2 is given by 


MSE™AL-? — 5 (7? + Elsi? — 2yE|s|) - Tr(Ru) 


Problem IV.19 (Correlated Gaussian regressors) Consider the LMS recursion (16.1) and as- 
sume the data {d(i), wi} satisfy model (15.16). Assume further that the steady-state condition 
(16.12) holds. In this problem we reconsider the performance of LMS for Gaussian regressors, 
as was done in Sec. 16.4, except that now we do not restrict the covariance matrix to be Ry = o?I. 
Introduce the eigen-decomposition R, = U AU , where A is a diagonal matrix with the eigenvalues 
of Ru, A = diag{A,}, and U is a unitary matrix (i.e., it satisfies UU* = U*U = I). Define the 
transformed quantities W; = U* Ùw; and u; = u,U. 

(a) Argue that T; is circular Gaussian with covariance matrix Eu; u; = A. Verify also that 

ealt) = Uiw;-1 and E juill? lea (2)? =E Tl? Jes (2) 2. Let Qizi = Ew;w;. 


(b) Use the result of Lemma A.3, and the derivation that led to (16.16), to show that 
Eluil^le.(i))? = Tr(A)Tr(C; 1A) + Tr(ACi-1A) 
Likewise, verify that E e; (i)|? = Tr(AC; 1). 
(c) Conclude that, in steady-state, the variance relation (16.5) becomes 


QNS _ 5 [Treas + Tr(ACxA) + etTi(^)] 


Remark 1. Thus observe that in this case, for Gaussian regressors with arbitrary covariance matrix Ru, 
the factor Coo does not disappear from the EMSE expression — see Sec. 23.1 and the EMSE expression 
in Thm. 23.3 for more details on this case. 

Remark 2. Observe further that the procedure for finding the filter EMSE through (15.40), as discussed in 
the body of the chapter, avoids the need to explicitly evaluate E ||ŭ; ||? or its steady-state value. This is in 
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contrast to the procedure of this problem, which requires determining the steady-state covariance matrix 
of Ù;, before arriving at an expression for the EMSE of the filter. 


Problem IV.20 (Regressor transformation) Refer to the discussion in App. 17.A on relating 
€—NLMS to LMS and consider, in particular, the transformed regressor à; = ui //e + [|ui||?. Set 
€ = 0 for simplicity in this problem and assume M = 1 (i.e., u; is a scalar). Assume further that the 
distribution of u; is not symmetric, say u; = 1 with probability 2/3 and u; = —2 with probability 
1/3. Verify that u; is zero-mean while ŭ; is not. 


Problem IV.21 (Lossless mapping) Refer to equation (15.47) and consider the mapping 
I-g()uiw = Vali) uf 


A(t) ui 0 


T= 


Verify that T is Hermitian and unitary, i.e., 7? = I. 


Problem IV.22 (Optimal step-size for LMS) Show that the expression for LV? in (21.10) is the 
unique minimum of (21.9) over the interval 0 < 1 < 2/Tr( Ru). 


Problem IV.23 (Optimal step-sizes for «-NLMS) Refer to the discussion in Sec. 21.2 on the 
tracking performance of e—NLMS. 


(a) Verify by differentiating (21.18) with respect to ps that 


enims _ | Tr(Q) (rQ» _ m(Q) 
How 7 Vuo? * meh Imo? 


(b) Differentiate instead (21.20) with respect to yz to obtain 


uns 1(] T(Q) TQ) _ TQ) 
A -z: E G/A leet | HED) 


Verify that this result agrees with the expression of part (a) when 7, and a, are replaced by 
E (1/||u.]|?). Assume further that the step-size is small enough so that 2 — u ~ 2. Justify the 


expressions 
c-NLMS _, 1 2 u ^ Tr(Q) ) 
Ral boy + oa 
; à + Fare) 
and 
«— NLMS Tr(Q) : &— NLMS e;Tr(Q) 
mab—— c. with NMS y a LÁ 
des oe |ui) Se E Gel 


(c) Differentiate (21.21) with respect to 1 and verify that the results of part (b) are still valid, and 
that the corresponding minimum EMSE for small p is now given by 


e—NLMS 1 
min a Tr(R,)4[ o2Tr(Q)E Gar) 
( m (Q) TAE 


Problem IV.24 (c-NLMS with power normalization) The purpose of this problem is to extend 
the derivation of Sec. 21.2 to the e-NLMS recursion with power normalization (cf. (21.50)(21.51)), 
where the regression vector is assumed to have shift-structure, say 


ui =[u(i) u(i—-1)u(i-2)... ui - M 1)] 


with leading entry u(i). Assume the regression data is circular Gaussian with covariance matrix 
Ry = o7I. Follow the arguments that led to (21.18), and the simplifications of Sec. 17.1, to conclude 
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that 


Cen PNLMS uM(i+ Bop + u^ ez (1 — 8)Tr(Q) 


(Gaussian regressors) 


2y(1 — 8) - »M(1 + 8) 
where y = 3 for real data and y = 2 for complex data. 


Problem IV.25 (Tracking performance of LMMN and LMF) Consider the LMMN algorithm given 
by (21.52). The least-mean fourth (LMF) algorithm corresponds to the special case ô = 0. Introduce 
the symbols 6 = 1 — ô, £4 = E[v(i)|*, and £f = E!v(i)|®, where the scalars (£2, £9) denote 
the fourth and sixth-order moments of v(), respectively. The data (d(i), ui, v(i)] are assumed to 
satisfy model (20.16). 


(a) Assume first that all data are real-valued. Repeat the arguments of Prob. IV.6 to arrive at the 
equality 


2bEe2() = po Tr(Q) + paTr(Ru) + ucE [uil eal) 
+ 855 (E ||u;| es (i)) Ev? (i) + 6n? (E lju: es())) Ev?(i), i — oc 


where a = 670? + 26664 + 5768, b = 6 + 3602, and c = 6° + 126602 + 158764. In order 
to simplify this result, we may consider three cases: 


(a.1) Sufficiently small step-sizes. Argue that when y is sufficiently small we get 


chMMN paTr(Ru) + a` Tr(Q) (MF pes Tr(R) + 7 Tr(Q) 


= $ 


2b 


602 


Differentiate these expressions with respect to 1 to conclude that 


Tr(Q) LMF _ Tr(Q) 


aT(R,)' “Pt ^M ETR) 


and 
tmun _ V@Tr(Q)Tr( Ru) LMF _ SS Tr(Q)Tr(Ru) 


min 7 b ? min 77 302 
v 


(a.2) Separation principle. For larger values of x, use the separation assumption (21.8) and 
the steady-state assumption (16.12) to conclude that 


cLMMN paTr(Ru) + u^  Tr(Q) (MF _ u& Tr(R.) + u~ Tr(Q) 
^. Qb—peTr(Ru)  ’ ~ 602 — 15u£4Tr(R.) 


Verify that 


m 1 
uat = mio (FED +) - ne 


(a.3) White Gaussian regressors. Assume that the regression vector u; is Gaussian with 
covariance matrix Ru = o?I, and ignore the terms in v?(i) and v?(i). Argue that in 
this case 


cLMMIN uMoia + u^ Tr(Q) QM Moes + n Tr(Q) 
602 — 15u(M + 2)e2£4 


2b — uc(M + 2)o2 ' 


(b) Assume now that the data are complex-valued. Show that the same results of part (a) still hold 
with the modifications b = 6 + 2602 and c = ô? + 86602 + 96764. 
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Problem IV.26 (Tracking of e—APA) The purpose of this problem is to extend the energy-conservation 
approach of Sec. 20.3 to evaluate the EMSE of the c-APA algorithm of Sec. 13.1, namely, (21.53)- 
(21.55). Assume the data satisfy model (20.16). 


(a) Introduce the a priori and a posteriori error vectors €a, = U iv; i and ep, = Uit, 
where 15; = w^? — wi. Extend the argument of Prob. IV.7 to show that the result of Thm. 20.1 
becomes 


w? — wall? + e;(U;U1) ea; = lw? — wicil? + ez, (U7) ep, 


Conclude from the argument that led to (20.29) that the above result reduces under expectation 
to 


E |lwil|? + Eez,(U;Ui) es; = Eð]?  Ellail? Ee; (UU) ^ eo; 


(b) Introduce the matrices 


— 


Ay (d U;UI) UU} (I+ UU, B,-(d-U;UD)! 


Ignore any dependency between v; and e,,; in steady-state. Argue that, as i — oo, the result 
of part (a) leads to the variance relation 


pE [e2,;Aies;] + uE [vt Aivi] + E llq,l? = 2b [e2;Bies;], asi— oo 


(c) Assume that, at steady-state, U; is independent of e4,; and Ees;ez,; = (Eles(i)|?) - S, 
where S ~ I for small p and S & bob] for larger jj. Here bo denotes the first basis vector, 
bo — col(1,0,...,0). Conclude that for small e, the filter EMSE can be approximated by 
any of the following expressions: 


1 


CDA) [voz Tr(E Ai) + u^! Tr(Q)] 
1 


oTr(EA:) + 47 Tr(Q)] 


G— JE A0,0) [M7 


8-2 ^ gg) +O 


where A; œ (U;U 1)! and A;(0, 0) denotes the (0, 0) entry of Aj. 


Problem IV.27 (Tracking of e—APA) Consider the family of affine projection algorithms described 
by (13.14), for some integers {a, D}, namely, 


Wi = Wi-t-a(K-1) + HU? (I+ U,U;) [di - Uiwi-i-a(K-1)] 


where now {U;, d;) are replaced by 


Ui d(i) 
ui- dli- D 
U; ê id di Gre) 
Vi-(K—-1)D d(i - (K —1)D) 


Let K’ = o(K — 1) and define the a priori error vector as eo; = Ui(w? — wi 1 ie). 


(a) Repeatthe arguments of Prob. IV.26 to to verify that the energy-conservation relation becomes 


w? — wall? + e2;(U;U1) eas = lw? — wii- l? + ef (UUI) ep; 


which, under expectation, gives 
E [i| + eS (U;UT) "ea; = Ellibici-ie ||? +(1+ K’)TH(Q) e, (U:U1) ep 


(b) Show that (25, — uau )E Jea (i)|? = ue2Tr(E Ai) + n^ (1 -- K')Tr(Q) as i — oo, so that 
the excess mean-square error (EMSE) of the filter is now given by 


ÇAPA _ woe TE AQ ko T Core) TEO) 


(2n. -uau ) 


Problem IV.28 (Trace inequalities) Consider a collection of M positive scalars (A;). 


(a) For any two positive numbers a and b, use the fact that (a — b)? > 0 to conclude that a/b + 
b/a 2 2. Now argue that the term (Se X) : (oe 1/X) consists of M products of 
Ai by 1/A; and M(M — 1)/2 sums of the form Ax + s.a for i # j. Conclude that 


eee 
(b) Use as XJ. = oe, X) + (ei ies 2X). and the fact that 22,4; < 
A? + A3, to conclude that (s X) E (oY, 3) 


(c) Let Ru be any M x M Hermitian positive-definite matrix with eigenvalues (A;). Use the re- 
sults of parts (a) and (b) to establish that [Tr(Ru)]? < M Tr(R2) and M? < Tr(R,)Tr(R;). 


Problem IV.29 (Model with frequency offset) Consider data (d(i), wi} that satisfy the linear 
relation d(i) = u;w?e? + v(i), where v(i) denotes measurement noise and Q models some 
constant frequency offset (Q could be zero as well). Assume further that w? varies according to the 
auto-regressive model: w? = w° + 0; and 0; = o0i.1 + q; with O < |a| < 1. In other words, 
w? undergoes random variations around its mean w^, with the perturbations 0; being generated by 
a first-order auto-regressive model with a pole at a and a random initial condition denoted by 0. 1. 

The purpose of Probs. IV.29 and IV.30 is to extend the variance analysis of Secs. 20.3 and 20.4 
to such more general non-stationary models. Subsequent problems then examine the performance of 
several adaptive filters under these conditions. So consider adaptive filters of the form (20.21) and 
define the error quantities: 


4 o jRi o jhi ji 
= wie ie oe wi] 


—wi- epi) 2 uilwie 


—wi, ea(i) à uilw 


(a) Establish the relations 


ea(i) — plluill?gle(i)], ws = Wi-1 — wutgle(i)] + cie? 67? 


where c; = w^(e/? — 1) + 0; 1(ae/9 — 1) + qe. 
(b) Establish the energy-conservation relation 


lõ: — eie? 679]? + alied)? = ll? + aleli)? 


where ji(i) = 1/||u:||? if u; 4 0 and (i) = 0 otherwise. 
(c) Establish also the relations 


wii + ĝi — 0i 


€a(i) + v(i) = uiti- + ui (we? — w? ,) e* 679 + v(i) 


(d) Show that the model of this problem encompasses the random-walk model (20.10) as a special 
case. 


Remark. In this problem we adopted a model with a constant frequency offset 2. The term ejN in the adopted 
model could be used, for example, to model Doppler channel variations in a wireless scenario, which result from 
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reflections of the transmitted signal off a remote object moving with constant speed (e.g., Ghosh (1998)). Ac- 
tually, many digital communication standards use the ability of digital communication receivers to track such 
Doppler shifts as a performance index for their ability to track time variant channels. A more general model 
would be to consider a time variant term of the form e/(®+#(4)) in order to account for both frequency and 
phase offsets. 


Remark. The results of Probs. IV.29-IV.33 are based on the work by Yousef and Sayed (2002). 


Problem IV.30 (Variance relation) Consider the same setting of Prob. IV.29 and assume that q; 
is an i.i.d. sequence with covariance matrix Q and independent of the initial conditions {w-1, 89-1}, 
of the regressors {u;} for all j, and of the (d(j)) for all j < i. Assume further that the filter is 
operating in steady-state, i.e., E ||i;||? = E ||Ū:-1||? as i — oo. Take expectations of both sides of 
the energy-conservation relation of part (b) in Prob. IV.29 and show that in the limit, as 1 — oo, it 
leads to the following variance relation: 


2Re(Eez()g[eG)) = wE (lul? Igle) + n7! TQ) 
tun — 27 Pll]? + wo" [1 - ae)? THO) 


- 24 Re (1 = e JPyw^*E (iy E uui gie(i)])e-7 67») 
— Qu Re |(1 — o*e7??)E (62-1 (i-1 — pur gle(i)]e-967»)] 


where © = lim;..E@:07 = Q/(1- |a|?). Remark. Observe that when a = 1 and Q = O, the 
last three terms on the right-hand side of the above equality disappear and the relation collapses to the variance 
relation (20.32). 


Problem IV.31 (Mean-weight error by LMS with frequency offset) Here and in Probs. IV.32- 
IV.33, we use the results of Probs. IV.29-IV.30 to characterize the tracking performance of LMS for 
the more general nonstationary model of Prob. IV.29. Thus consider the LMS recursion w; = 
wi-1 + puie(i) with e(i) = d(i) — w;wi-i, and assume that (21.12) holds, namely, 


At steady state, 1b; . is independent of u; 


Assume further that the data {d(), u;) are such that: 


(a) There exists a vector w? such that d(i) = u;w?e/** + v(i). 

(b) The weight vector varies according to w? = w° + 6i. 

(c) The perturbation varies according to 0; = o8i-1 + q;. 

(d) The noise sequence {v(i)} is i.i.d. with variance o2 = E |v(i)]?. 

(e) The noise sequence v(i) is independent of u; for all i, 7. 

(f) The sequence q; has covariance Q and is independent of {v(j), uz} for all i, j. 
(g) The initial conditions {w_1,@_1} are independent of all {d(j), uj, v(j), 4; }- 
(h) The regressor covariance matrix is denoted by Ry = Eujui > 0. 

(i) The coefficient o satisfies |a] < 1. 


In order to apply the variance relation of Prob. IV.30, we need to evaluate the expectations E w;~1 
and E07 . , d»; 1. The purpose of this problem is to show that, in steady-state, E ib; takes the form 


Edo; = vef, i oo 


for some constant vector v to be determined. 
(a) Let v; = E dbi. Show that v; satisfies v; = (I— wRu)vi-1 + (IC pRu jw? (e? — 1)e/ 9670 
as i — oo. 


(b) Introduce the eigenvalue decomposition of Ru, say Ru = U AU*, where U is a unitary matrix 
and A is a diagonal matrix with positive entries {A1, A2,..., Am }. Let v; (k) denote the k-th 


entry of the transformed vector vi = U*v;. Likewise, let c'(k) denote the k—th entry of the 
transformed vector c' = U*w?(e?® — 1). Show that vj (k) satisfies the steady-state recursion 


vi(k) = (1 — pu)vi í(K) + (1— màr) (k)ef D, k=1,...,M, ioo 


(c) Assume p satisfies u < 2/Amax, Where Amax denotes the maximum eigenvalue of Ru. Argue 
from the recursion in part (b) that, in steady-state, v; (Kk) tends to v;(k) = be?®*, for some 
constant b. 

(d) Conclude that, in steady-state, the vector v; = Eb; tends to the form v; = vei, where 
v2[r-e?a- pRa) w(1 — e?) 


Problem IV.32 (Cross-correlation by LMS with frequency offset) Consider the setting of Prob. 


IV.31. We now verify that, in steady-state, the matrix E 4;07 takes the form 
Ed;0: =We™, i-oo 


for some constant matrix W to be determined. 
(a) Let W; = E &b;6]. Show that W; satisfies W; = a” (I — wRu)Wi-1 — (L- uR,)Ce/? 67? 
as i — oo, where C = a* (1 — ae?) — ei? Q. 


(b) Define the transformed matrices W; = U*W; and C’ = U*C. Use arguments similar to 
those in parts (c) and (d) of Prob. IV.31 to show that each element of W; converges to a 
constant times the exponential sequence ef% 


(c) Conclude that W; tends to We**, where W = [a*I — e/?(I— uR,)!] TU 


Problem IV.33 (Tracking by LMS with frequency offset) We use the results of Probs. IV.31 


and IV.32 to characterize the tracking performance of LMS for the nonstationary model of Prob. IV.29. 


Use gle(i)] = ea(i) + v(i) and substitute the expressions for E 4b; and E w;@7 into the variance 
relation of Prob. IV.30. Then proceed to address the following questions. 


(a) Assume the step-size is such that the term E ||u;||?|es (1)|? can be neglected. Show that the 
resulting EMSE is given by 


-1 
E or Tr(Rs) + EE (small y) 


8 = |L- e?) ReTr[(I- 2(X — pRu))W"| 


where 


+ |1 — ae? Re Tr [(I — 2(a" Xa + uR,))8] + Re Tt [a + 2(g? — a*)Xa)Q] 


with X = (I- wR.) [I - e?(1— pRa) ], Xa = (1 - uRu) [o1- e?(1- pRu)*]7 


and W° = w°w™. 


(b) Use instead the separation principle (21.8) to show that the EMSE evaluates to 


2 EST 
(MS = RAO (over wider range of 4) 


(c) Assume Gaussian regressors with Ru = 021. Show that, in this case, the EMSE evaluates to 


ims _ Mozog + pB : 
C et EAE uM oz (Gaussian) 
where M is the filter length, y = 1 if the (u;) are circular complex-valued and y = 2 if the 
{u;} are real-valued. Moreover, @ is as in part (a) with Ru replaced by c1. 
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Problem IV.34 (Tracking performance of CMA2-2 for real-data) In this problem we consider 
the CMA2-2 recursion of Prob. IV.16, namely, 


Wi = i-i + pul z(i)y — z?(i)], z(i) = WiiWii, i20 


and study its tracking performance in nonstationary environments. Motivated by the discussion in 
Prob. IV.16, we assume that there exists an optimal equalizer w? that varies according to the rule 
w? = w? + q;, where q, is i.i.d. with covariance matrix Q. Moreover, q; is assumed to be 
independent of all other random variables. The model w? is such that it reproduces the complex data 
(s(i)) up to some delay A, i.e., uw? = s(t — A). The tracking performance of the algorithm is 
measured in terms of the steady-state value of E |s(i — A) — uiwi-il?. 


(a) Verify that the energy-conservation relation of Thm. 20.1 still holds for CMA2-2 


(b) Verify also that the variance relation of Thm. 20.2 still holds, namely, for any data {u;}, 
HE llui?’ g? [2] + n7! Tr(Q) = 2E ea(i)glz(i)] as i — oc. 

(c) Using g[z] = z[» — z?], and the assumptions from Prob. IV.16, show that the MSE of CMA2- 
2 is given by 


BE (3? s? ~ 2ys* + 8°) Tr(Ru) + wo Tr(Q) 


M CMA2—2 PS 
E 2E (387 — 7) 


in terms of the second, fourth, and sixth moments of the constellation. 


(d) Conclude that, when the denominator below is nonzero, an optimal choice for the step-size 
that minimizes the MSE is given by 


osa. [| 0. Tg 
Hopt E (72s? — 2ys4 + 86) Tr(Ru) 


with the resulting minimum MSE 


MSECMA2-2 .. V Tr(Q)E (Ys? — 2784 + 8°) Tr(Ru) 
‘min E (38? 3 4) 


Remark. The results of this problem, and of Probs. IV.35 and IV.36, are based on the work by Yousef and Sayed 
(1999b,2001). 


Problem IV.35 (Tracking performance of CMA2-2 for complex-data) In this problem we ex- 
tend the results of Prob. IV.34 to complex-valued constellations. Thus consider the CMA2-2 recur- 
sion, 


wi = wi-icuilz)v-|z()] zli) = wiwi- iz0 


and assume there exists an optimal equalizer w? that varies according to the rule w? = w?_, + q,, 
where q, is i.i.d. with covariance matrix Q. Moreover, q; is assumed to be independent of all other 
random variables. The model w? is such that it reproduces the complex data {s(i)} up to some 
rotation 0 and delay A, i.e., uiw? = s(i — A)e/. 


(a) Verify that the variance relation of Thm. 20.2 still holds, namely, for any data {u;}, 
HE uil lez D) + UTAQ) = 2Re(Eez()sle(]), asi oo 


(b) Now introduce the same assumptions as in part (b) of Prob. IV.17 and show that the MSE of 
CMA2-2 is given by 


CMA2-2 | 
ouem 2E (lel? — >) 


in terms of the second, fourth, and sixth moments of the constellation. 


(c) Assume the data {s(i)} are constant-modulus, |3(i)| = 1, and choose y = 1. What would 
the resulting MSE be? 


(d) Show that, when the denominator below is nonzero, an optimal choice for the step-size that 
minimizes the MSE is given by 


won E Tr(Q) 
E E (Y?|sl? — 2y|s|* + |s|) Tr(Ru) 


with the resulting minimum MSE 


cwa-a VOER -AEF ENTE) 
a EGP —) 


Problem IV.36 (Tracking performance of CMA1-2) Consider the CMAi-2 recursion of Prob. IV.18, 


namely, 


wi = Wi-1 + put peak —z(i) z(i)2uiwiai i20 


Ix) 
and let us study its tracking performance in nonstationary environments for complex data. We again 
assume that there exists an optimal equalizer model w? that varies according to the rule w? = 
w$ +q, where q; is i.i.d. with covariance matrix Q. Moreover, q; is assumed to be independent 
of all other random variables. The model w? is such that it reproduces the complex data {s(i)} up 
to some rotation @ and delay A, i.e., wiw? = s(i — A)e??. 
(a) Justify the variance relation E ||ul?|g[z ()]l? + i7 Tr(Q) = 2Re(E ez (2)g[z(1)]) as i > 
oo. Now consider the same assumptions imposed in Prob. IV.18 and show that this relation 
leads to the following expression for the MSE: 


-1 
MSECM^A1-? = 5 (** + E|s|? — 27E |s]) Tr(Ru) + E TQ) 


(b) Conclude that, when the denominator below is nonzero, an optimal choice for the step-size 
that minimizes the MSE is given by 


Mar- —— 
opt E (7? + |s|? = 2w|sl) Tr(Ru) 


and the corresponding minimum MSE is 


MSE? = VTr(Q)E (7? + |s|? — 27/8]) Tr(Ru) 


Problem IV.37 (Feedback mapping) Refer to the discussion in App. 15.A on a system-theoretic 
interpretation for the energy-conservation relation. How would you modify Fig. 15.5 in order to 
represent the energy-conservation relation (20.24) for nonstationary models? 


Problem IV.38 (Fixed-point arithmetic) Let b; € {0,1}. Consider a fixed-point representation 
of the form +0.b1b2...bg-1. This representation is limited to numbers that are less than one in 
magnitude. Therefore, additions and subtractions may cause overflow if the result lies outside this 
range, while the multiplication of two fixed-point numbers never causes overflow. Let e denote the 
machine precision, i.e., the largest absolute difference between a real number a and its fixed point 
representation, so that |fx[a] — a| € e for any a. Assume all variables are represented with B bits 
(including the sign bit) and that rounding is used so that € = 2B 


(a) Consider a random variable a with distribution 


is 0.5--277 with probability 0.5 — 277 
—0.5 4-277 with probability 0.5 + 277 
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have the distribution 


fda] = 4 OS + 276 with probability 0.5 — 277 
—0.5 with probability 0.5 + 277 


and conclude that E fx[a] = —27 3. 


Remark. The result of this problem shows that a zero-mean random variable may become non-zero mean after 
quantization. The small nonzero mean might cause a slow drift of the LMS weight estimate, causing the algorithm 
to overflow. 


Problem IV.39 (Drift problem) All variables in this problem are real-valued. Consider a quan- 
tized LMS implementation in a stationary environment with two taps, which according to the discus- 
sion in Sayed (2003, Ch. 8) takes the form 


wi = wh, + pu leali) + 9(2] - p; 
where w? denotes the quantized weight vector, ea (i) = u;157 ,, © = w° — w7, and v(i) and p, 
are assumed to be i.i.d. and independent of all other random variables with variances denoted by o2 
and Rp, respectively. Assume u; = [ u(i) 0 I where u(i) is and i.i.d. sequence with variance 


c2 and independent of all other variables as well. Therefore, in this example, the covariance matrix 
R, is singular and given by Ru = diag(c2, 0). Let C; denote the covariance matrix of w?, and 
partition C; and R, = E p;p} (assumed diagonal) into 


| eG cai) loe 0 
a= [96 a s] J 
(a) Show that cı (4) satisfies cı (i) = c1 (i — 1)E (1 — pu)? + u20262 +a. 


(b) Conclude further that c3(i) satisfies c3(¿) = ca(i — 1) + b with a forced term b and that, 
therefore, it grows unbounded. 


Remark. This example shows how the lack of sufficient excitation in the regression data, as determined by 
the singularity of Ru, may result in weight estimates growing unbounded and overflowing; thus leading 
to the drift problem of LMS. As explained below in part (c), one way to address the drift problem is to 
use leakage at the cost of introducing bias — see Probs. V.27-V.32. 


(c) Consider now a quantized implementation of leaky-LMS, which according to the discussion 
in Sayed (2003, Ch. 8) can be modeled as: 


wi = (1 — ua)w{_, + uui|e«(i) +0(i)] — pi 
Show that (ci (2), c3 (i) } now satisfy the recursions: 
al) = a(é-1)E(1— po - pu’) + p2o202 +a, csli) = (1 — noyles(i — 1) - b 


Conclude that the drift problem does not occur any longer for sufficiently small step-sizes. 


Problem IV.40 (Singular covariance matrix) The result of Prob. IV.39 reveals that a singular 
covariance matrix can cause LMS to drift in a finite-precision implementation. The case of singular or 
close-to-singular 2, is not an abstraction. It arises in some applications, e.g., in fractionally-spaced 
channel equalization (which is discussed in App. 3.B of Sayed (2003)). Figure IV.2 depicts the struc- 
ture of an adaptive fractionally-spaced equalizer. The received signal is denoted by u(t) and is sam- 
pled at the rate of 1/7" samples/second. This sampling rate is chosen to be double the symbol rate, 


1/T. The regressor of the equalizer is therefore u; — [ u(i) u(i-1) ... uGi-M+1) | 
where u(i) is obtained by sampling u(t) at the rate 1/T" = 2/T, i.e., u(i) = u(t) hair 


Training 


Decision 
directed 


FIGURE IV.2 A structure for adaptive fractionally-spaced equalization with training and decision-directed 
modes of operation. The update of the weight vector of the equalizer is performed at the symbol rate. 


The random process u(t) is assumed to be stationary and its power spectrum is defined by 
S,(jQ) = f? Ru(r)eI°" dr, where j = V-I and Ry, (r) is the auto-correlation function of 
the random process u(t), i.e., Ru (T) = Eu(t)u' (t — 7). Let Ry = Eusuf denote the covariance 
matrix of u;. The purpose of this problem is to show that there are power spectra Su (jQ) for which 
the matrix Ru can be singular. To do so, we shall show that there exists a nonzero vector x such that 
z"'R,z - O0. 


(a) Let x denote an arbitrary column vector with entries (z(1),..., z(M)). Use the inverse 
Fourier transform to show that 


s Rae 32 Y 8 (m)e(n)Rullm ATi f, suno E 


m-imnzl 


where X (j2) & poem z(nje ST, Observe that X (jQ) is periodic with period €; = 
2x/T’ = 4x/T. Conclude that Ry is singular if, and only if, a vector z exists such that 


a SuJIXGOPLS = 


(b) Assume Su (jQ) has bandwidth 2, = 27/T and is given by 


1 O< 
S(j9)-24 -1 LA 
0 otherwise 


Assume also that z is chosen such that | X (j9)|? is symmetric around Ns. Conclude that the 
resulting covariance matrix Ru will be singular. 


Remark. This problem is a modified version of an example from Gitlin, Meadors, and Weinstein (1982). 


COMPUTER PROJECTS 


Project IV.1 (Line echo cancellation) In communications over phone lines, a signal travelling 
from a far-end point to a near-end point is usually reflected at the near-end due to mismatches in 
circuitry (e.g., hybrid connections). The reflected signal travels back to the far-end point in the form 
of an echo. As a result, the speaker at the far-end receives, in addition to the desired signal from the 
near-end speaker, an attenuated replica of his own signal in the form of an echo — see Fig. IV.3. 
The echo interferes with the quality of the received signal. A common way to provide better voice 
quality at both ends is to employ adaptive line echo cancellers (LEC). At the near-end, for example, 
the signal feeding the LEC is the far-end signal while the reference signal is its reflected version — 
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Near-end 


Echo Reflection 
due to circuit 
imprefections/mismatches 


FIGURE IV.3 The signal at the far-end is reflected at the near-end due to circuit mismatches and travels back 
to the far-end. 


see Fig. IV.4. In the figure, the output of the adaptive LEC generates a replica of the echo, and the 
error signal is therefore a “clean” signal that is transmitted to the far-end. The signals in this project 
are assumed to be sampled at 8 kHz. 


From far-end 1 


Near-end 


To far-end 


LEC at near-end 


FIGURE IV.4 An adaptive line echo canceller at the near-end. 


(a) Load the file path.mat, which contains the impulse response sequence of a typical echo path. 
Plot the impulse and frequency responses of the echo path. 


(b) Load the file css.mat, which contains 5600 samples of a composite source signal; it is a 
synthetic signal that emulates the properties of speech. Specifically, it contains segments of 
pause, segments of periodic excitation and segments with white-noise properties. Plot the 
samples of the CSS data, as well as their spectrum. 


(c) Concatenate five such blocks and feed them into the echo path. Plot the resulting echo signal. 
Estimate the input and output powers in dB using 


N 
P = 10log 1 signal(i)|? 
10 N 


i=l 


where N denotes the length of the sequence. Evaluate the attenuation in dB that is introduced 
by the echo path as the signal travels through it; this attenuation is called the echo-return-loss 
(ERL). 


Use 10 blocks of CSS data as far-end signal, and the corresponding output of the echo path as 
the echo signal. Choose an adaptive line echo canceller with 128 taps. Train the canceller by 
using as input data the far-end signal, i.e., u(i) = far.end(i), and as reference data the echo 
signal, i.e., d(i) = echo(i). Use e-NLMS with e = 107 and jj = 0.25. Plot the far-end 
signal, the echo, and the error signal provided by the adaptive filter. Plot also the echo path 
and its estimate by the adaptive filter at the end of the simulation. 


(d 


— 


(e) Estimate the steady-state power of the error signal and measure its attenuation in dB relative 
to the echo signal. Use the last 5600 of the signals to estimate their powers. The difference in 
power is a measure of the attenuation introduced by the LEC and it is called the echo-return- 
loss-enhancement (ERLE). 


(f) Fix the input power at 0 dB and add white Gaussian noise with variance c? — 0.0001 to the 
echo signal. Train the LEC using 80 blocks of CSS data and measure the steady-state ERLE. 
Compare the simulated and theoretical ERLEs. 


Project IV.2 (Tracking a Rayleigh fading channel) Ina wireless communications environment, 


signals suffer from multiple reflections while travelling from the transmitter to the receiver so that 
the receiver ends up getting several (almost simultaneous) replicas of the transmitted signal. The re- 
flections are received with different amplitude and phase distortions, and the overall received signal 
is the combined sum of the reflections. Based on the relative phases of the reflections, the signals 
may add up constructively or destructively at the receiver. Furthermore, if the transmitter is moving 
with respect to the receiver, these destructive and constructive interferences will vary with time. This 
phenomenon is known as channel fading. 
The impulse response of a single-path (i.e., single-tap) fading channel can be described as 


h(n) = y z(n) 6(n — no) 


where {x(n)} is a time-variant complex sequence that models the time-variations in the channel, 
and n, is the channel delay. Without loss of generality, the sequence (z(n)) is assumed to have unit 
variance, and the scalar ~ is used to model the actual path loss that is introduced by the channel. That 
is, y? is equal to the power attenuation that a signal will undergo when it travels through the channel. 

Several mathematical models can be used to characterize the fading properties of (z(n)), and 
consequently of the channel. A widely used model is known as Rayleigh fading. In this case, for 
each n, the amplitude|z (n))| is assumed to have a Rayleigh distribution (cf. (A.3)), i.e., 


familen) = eln) ec 9972, jen) 20 


while the phase Za: (n) is assumed to be uniformly distributed within [—7, 7]: 


füm(m)- zs -7S Len) Sr 


In addition, the auto-correlation function of the sequence {x (n) }, now regarded as a random process, 
is modeled as a zeroth-order Bessel function of the first kind, namely, 


2 Ex(n)x(n—k) = Jo (2nfpT, k), k=...,—1,0,1,... 
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where T; is the sampling period of the sequence {x(n)}, fp is called the maximum Doppler fre- 
quency of the Rayleigh fading channel, and the function .7,(-) is defined by 


Joly) Ê iF cos(y sin 6)d@ 
0 


T 
The Doppler frequency fp is related to the speed of the mobile user, v, and to the carrier frequency, 
fc, as follows: 

RET 
where c denotes the speed of light, c = 3 x 108m/s. This commonly used choice of the auto- 
correlation function is based on the assumption that all scatterers are uniformly distributed on a circle 
around the receiver, so that the power spectrum of the channel fading gain a(t), in continuous-time, 
would have the following well-known U —shaped spectrum, 


1 


S(f) = —————, 
n fp4J/1 — (4) 


Other assumptions on the distribution of the scatterers would lead to different auto-correlation func- 
tions. 


|F| S fp 


(a) Assume a carrier frequency of fe = 900MHz. Verify that the Doppler frequency that corre- 
sponds to a vehicle moving at the speed of v = 80Km/his fp = 66.67Hz. 


(b) Use the program rayleigh.m to generate 2000 samples of a Rayleigh fading coefficient x(n) 
with Doppler frequency fp = 66.67Hz and sampling period Ts = 1ms. Plot its amplitude 
and phase. Plot also the cumulative distribution function (cdf) of |a(n)| (by definition, for 
any particular amplitude value «x, the cdf shows the proportion of occurrences smaller than «). 
Plot also the amplitude of a Rayleigh fading sequence with the same Doppler frequency but 
with sampling period T, = lus. What do you observe? 


(c) As explained above, fading is caused by the addition of several refiections that reach the 
receiver approximately at the same instant. However, in many cases, other reflections might be 
originated from a far away object such as a mountain or a tall building. These reflections arrive 
at the receiver with longer delay than the first group of reflections. In such situations, a single- 
path Rayleigh fading model is not adequate anymore to represent the wireless channel. To 
model this so-called multipath phenomenon, a finite-impulse response model for the channel 
can be used, say one of the form 


L 
h(n) = Y ^ v (n)ó(n — ne) 
k=1 


where (^) and (2 (n)) are respectively the path loss and fading sequence of the k—th clus- 
ter of reflectors, and the (n4) are the cluster delays. The sequences (2; (n)) are modeled as 
independent Rayleigh fading sequences and the channel is referred to as a multipath Rayleigh 
fading channel. 

In this project, we consider a wireless channel with two Rayleigh fading rays; both rays are 
assumed to fade at the same Doppler frequency of fp = 10Hz. The channel impulse response 
sequence consists of two zero initial samples (i.e., an initial delay of two samples), followed 
by a Rayleigh fading ray, followed by another zero sample, and by a second Rayleigh fading 
ray. In other words, we are assuming a channel length of M = 5 taps with only two active 
Rayleigh fading rays, so that the weight vector that we wish to estimate has the form: 


[o 0 ae(n) 0 saln) | 


Train an LMS filter to estimate and track this multipath channel. Assume white input of 
unit variance is transmitted across the channel and use it to excite the adaptive filter. Assume 
further that the output of the channel is observed in the presence of white additive Gaussian 


(d 


(e 


) 


— 


(f) 


(g 


— 


noise with variance c2 = 0.001. Use u = 0.01 and average the learning curve of LMS 
over 100 experiments. Plot the learning curve and compute the resulting MSE. Plot also the 
time evolution of the first ray and its estimate by the LMS filter over a particular experiment. 
Use the function rayleigh.m to generate time sequences for both channel rays, say of duration 
30000 samples, and assume the sampling period is T, = 0.8 ys. 


A first-order approximation for the variation of a Rayleigh fading coefficient æ(n) is to as- 
sume that a(n) varies according to the auto-regressive model: 


a(n) — r(1)z(n - 1) + v1- |r(1)? n(n) 


where r(1) = A, (2m fp T;) and n(n) denotes a white noise process with unit-variance. 
Now since the multipath rays of the channel in part (c) are assumed to fade at the same rate, 
the above approximation indicates that the variations in the channel weight vector could be 
approximated as w? = ow?.., + q;, where the covariance matrix of q; is Q = (1 — o?)I 
with a = r(1). Of course, the value of a depends on the Doppler frequency. Use this model, 
and the second expression for the EMSE of LMS from Lemma 21.1, namely, 


QNS o noz Tr(Ra) + n7  Tr(Q) 
~ 2 — pTr(Ru) 


to evaluate the theoretical MSE for part (c); recall that MSE — o2 -- EMSE. Compare your 
answer with the simulated result from part (c). 


The expression for the EMSE that is used in part (d), was derived in Lemma 21.1 by using the 
nonstationary model (20.10), which assumes a = 1. Since the model for the Rayleigh fading 
channel under consideration has 


a = r(1)2J(2xmx10x0.8 x 1075) ~ 0.99999999936835 ~ 1 


we are justified to rely on Lemma 21.1. Now recall that in Prob. IV.33 we studied the tracking 
performance of LMS for a more general nonstationary model, which allows the inclusion of a 
non-unity o. If we set (2 = 0 in that problem, we find that the EMSE of LMS for a model of 
the form w? = aw?_, + q; is given by 


(s = uo? tr(Ru) +B 
2—ptTr(Ru) 


where 
8=Tr{(1+2(1-a)Xa)Q], Xa = (1- Ru) [al - (1- pRa)’ 


Compute the theoretical MSE of LMS using this alternative expression and compare it with 
the simulated and theoretical results of parts (c) and (d). 


For the same scenario of parts (c) and (d), vary the Doppler frequency from 10Hz to 25Hz 
in increments of 5Hz and generate a plot of the MSE as a function of the Doppler frequency. 
Run the LMS filter for 60000 iterations in each case and average the squared-error curve over 
100 experiments. Continue to use u = 0.01. What do you observe? 


Fix the Doppler frequency at 10Hz and run the LMS filter for 20000 iterations for the step- 
sizes 
u € (0.003, 0.005, 0.007, 0.01, 0.03, 0.05, 0.07) 


For each case, generate a learning curve by averaging over 50 experiments and compute the 
mean of the last 200 entries of the resulting curve. Use this value as indicative of the MSE 
of the algorithm. Generate a plot showing the MSE value versus the step-size. Compare with 
the theoretical MSE. 
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Chapter 22: Weighted Energy Conservation 
Chapter 23: LMS with Gaussian Regressors 
Chapter 24: LMS with non-Gaussian Regressors 
Chapter 25: Data-Normalized Filters 


Summary and Notes 
Problems and Computer Projects 


CHAPTER 2 2 


Weighted Energy Conservation 


A. is evident by now, adaptive filters are time-variant and nonlinear stochastic systems 
with inherent learning and tracking abilities. The success of their learning mechanism can 
be measured in terms of how well they learn the underlying signal statistics given sufficient 
time (i.e., in terms of their steady-state performance) and in terms of how fast and how sta- 
bly they adapt to changes in the signal statistics (i.e., in terms of their transient and conver- 
gence performance). For this reason, it is customary to study the performance of adaptive 
filters by examining their transient performance and their steady-state performance. The 
former is concerned with the stability and convergence rate of an adaptive scheme, while 
the latter is concerned with the mean-square error that remains in steady-state. 

In Part IV (Mean-Square Performance) we focused on the steady-state performance 
of adaptive filters in both stationary and nonstationary environments, as well as on the 
performance degradation that occurs in finite-precision implementations. In this part we 
turn our attention to the transient performance of adaptive filters. We continue to rely on 
the energy conservation arguments of Chapter 15 and show how these same arguments 
can be used to perform transient analysis in a uniform manner across different algorithms. 
Compared with the derivations in Chapters 15, and for reasons explained below, it will turn 
out that transient analysis is more conveniently performed by relying on a weighted energy- 
conservation relation, as opposed to the unweighted version that we have employed so far 
in the book. 

We focus in this part on the class of adaptive filters with data normalization for which 
a more detailed transient analysis is easier to advance. In App. 9.C of Sayed (2003), it is 
shown how to extend the arguments to the class of adaptive filters with error nonlinearities; 
the transient analysis for this class of filters is more demanding due to their use of nonlinear 
error functions. 


22.1 DATA MODEL 


We rely on the same data model that we adopted in Sec. 15.2 for stationary environments. 
Thus, let d(i) denote the reference sequence and u; denote the regressor sequence. We 
then assume that the data (d(i), u;} satisfy the following conditions: 


(a) There exists a vector w° such that d(i) = u;w? + v(i). 

(b) The noise sequence (v(i)) is i.i.d. with variance o2 = E |v(i)|?. 
(c) The noise sequence (v(i)) is independent of u; for all i, j. (22.1) 
(d) The initial condition w.. is independent of all {d(j), uj, v(7)}. à 
(e) The regressor covariance matrix is denoted by R, = E užu; > 0. 
(f) The random variables (v(i),u;) have zero means. 
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In the above list, w_, denotes the initial condition for an adaptive filter and it is assumed 
to be independent of all data. Moreover, in accordance with our convention, the wu; are 
taken as row vectors. All other vectors are column vectors. The unknown weight vector 
ww? and the adaptive filter weight vector w; both have dimensions M x 1. We could 
have adopted a more general data model than (22.1), such as the nonstationary model of 
Sec. 20.2. However, it will become clear that the arguments developed in this chapter for 
the stationary case can be extended to nonstationary models in a straightforward manner 
(see, e.g., Probs. V.33 and V.34). 


22.2 DATA-NORMALIZED ADAPTIVE FILTERS 


We shall study filter updates of the form 


(22.2) 


where w; is an estimate for w° at iteration 2, p is the step-size, 
e(i) = d(i) — uiwi-i (22.3) 


is the estimation error, and g/-] is some positive-valued function of u;. For example, the 
choice g[u;] = 1 results in the LMS algorithm, while gfu;] = e + ||u;|j? results in the 
e — NLMS algorithm. One could also study more general data-normalized updates of the 
form 

w;-wiitàuH[u]wjea) i20 (22.4) 


where H[-] is some Hermitian positive-definite matrix-valued (as opposed to scalar-valued) 
function of u;. Examples of choices for H[-], leading to more general forms of adaptive 
filters, are treated in the problems at the end of this part (see Probs. V.22, V.23 and V.36). 


22.3 WEIGHTED ENERGY CONSERVATION RELATION 


Since we shall deal with weighted vector norms on a regular basis in this chapter, we adopt 
the compact notation ||z||, to refer to the weighted squared Euclidean norm of a vector z, 
ie., 

lel 2 atte 
for some Hermitian positive-definite weighting matrix X. The choice © = I results in the 
standard squared Euclidean norm of z, i.e., 


leli = z*z = |||? 


Although we deal in general with the case © > 0, we shall use the same notation ||z!/% to 
denote z* Lz even when X is non-negative definite — see Prob. V.1. 

The need to consider weighted norms in the context of a transient analysis of adaptive 
filters can be motivated as follows. It will be seen that the transient performance of an 
adaptive filter requires that we study the time evolution of expectations of the form E ||w,||? 
and E |e, (i)|?, where the first expectation relates to the weight-error vector, 


o 
Wi SW — Wi 


while the second expectation relates to the a priori estimation error, 
Ealt) = U;Wi-} 


The evaluation of the expectation E |ea(i}|? will in turn require that we evaluate a weighted 
norm of wW;_1 of the form 
E |[&; 112, 


with the particular weighting matrix X = R,,. Now the energy conservation relation that 
we encountered in Thm. 15.1 involves the squared Euclidean norm of w;—1, |i; .1]I?, 
and not any weighted version of it. For this reason, we shall first extend our arguments 
to allow for weighted vector norms. As we shall see, a weighted version of the energy 
relation can be obtained rather immediately by following arguments similar to those that 
led to Thm. 15.1. 

Thus, let © denote any M x M Hermitian nonnegative-definite matrix (in general, we 
shall have X > 0). Later we shall see that different choices for X are useful to infer 
different conclusions about the performance of an adaptive filter. Now define the weighted 
a priori and a posteriori error signals 


AA m 
eE (i) = Uj;uWi-1, 


226) 


When È = I, we recover the standard errors 


e(i) Ê uw; (22.5) 


where 


1 (i) = U;W;-1, ep(i) 4 ei(i) = [XT (22.7) 


The energy relation that we seek is one that compares the weighted energies of the error 


quantities: 
($i, Wi-1, ež (i), ež (i) } (22.8) 


To arrive at the desired relation, we follow the same arguments that we employed before 
in Sec. 15.3. 

First, we rewrite the update recursion (22.2)-(22.3) in terms of the weight-error vector 
tv;. Subtracting both sides of (22.2) from w° we get 


(22.9) 


If we further multiply both sides of (22.9) by u;X from the left we find that the a priori 
and a posteriori estimation errors {e} (i), eZ (1)) are related via 


(22.10) 


We distinguish between two cases: 
1. ||u;]|$ # 0. In this case, we use (22.10) to solve for e(i)/g[ui], 


deal egg eol 
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and substitute into (22.9) to get 


uj z( ) ui aa ) 
bi Li eP() = dua + irel( (22.11) 
luis * full? 
On each side of this identity we have a combination of a priori and a posteriori 
errors. By equating the weighted Euclidean norms of both sides of the equation, i.e., 


by setting 
2 


u* ui 
di sol = la. 1+ el) 
AE me eE ?^ lig 
we find, after a straightforward ae that the following energy equality holds: 
1 = 2 1 E(2 
mill + eles)? = fecil + Lez 
I|uills; luill 


Observe that this equality simply amounts to adding the weighted energies of the 
individual terms of (22.11); the cross-terms cancel out. Equivalently, we can rewrite 
the above equality as 


(22.12) 


where A i " 
-zn A J llus if ||; || 4 0 
US { 0 otherwise (22.13) 


. [[u;| = 0 (since X > 0, this implies that u;Z = 0). In this case, it is obvious from 


(22.10) that e? (i) = ež (i) and from (22.9) that ||t,||%, = ||®:-1 ||, so that (22.12) 
is valid again. 


In summary, we arrive at the following statement. 


Theorem 22.1 (Weighted energy-conservation relation) For any adaptive fil- 
ter of the form (22.2), any Hermitian nonnegative-definite matrix ©, and for 
any data {d(i), u;}, it holds that 


lile + BC) les)? = lili + AC) ep0 


where e2(i) = u;YXi.i, ež (i) = w;Xd, Wi; = w? — wi, and p"(i) is 
defined by (22.13). 


The important fact to emphasize here is that no approximations have been used to es- 
tablish the energy relation (22.12); it is an exact relation that shows how the energies of 
the weight-error vectors at two successive time instants are related to the energies of the a 
priori and a posteriori estimation errors. The special choice © = I reduces to the energy 
relation of Thm. 15.1. In addition, the same geometric, physical, and system-theoretic in- 
terpretations that were presented in App. 15.A for the case © = I can be extended to the 
weighted case with little effort — see App. 9.G of Sayed (2003). 

We also rewrite below, for later reference, the weight-error recursion (22.9). From the 
modeling assumption (22.1) we have 


e(i) = d(i) — UiWi-1 = U;Wi-1 + v(t) 


so that substituting into (22.9) we get the equivalent form 


a, = (t L niit) a, = as qeu (22.14) 


——_ o 
Theorem 22.2 (Weight-error recursion) For any adaptive filter of the form 
(22.2), and for data (d(i), u;) satisfying (22.1), it holds that 


E (een) diei = gaye 


where Ùw; = w? — wi. 


22.4 WEIGHTED VARIANCE RELATION 


Relation (22.12) has several useful ramifications in the study of adaptive filters, as was 
already discussed at some length in Part IV (Mean-Square Performance). In this part 
we focus on its significance to transient analysis. Thus, recall that in Sec. 15.2 we used 
the energy-conservation relation (15.32) to study the steady-state performance of adaptive 
filters. In that section, we invoked the steady-state condition 


E |||? = Ej; ill? as i oo 


in order to cancel the effect of the weight-error vector from both sides of the energy rela- 
tion, and then used the resulting variance relation (15.40) to evaluate the filter EMSE, i.e., 
the value of E le; (oc)|?. 

In transient analysis, on the other hand, we will be interested in the time evolution of 
E ||w,||% for some choices of interest for X (usually, X = I or X = R,). For this rea- 
son, in transient analysis, rather than eliminate the effect of the weight-error vector from 
(22.12), the contributions of the other error quantities, (eZ (i), ež (i)}, will instead be ex- 
pressed in terms of the weight-error vector itself. In so doing, the energy relation (22.12) 
will be transformed into a recursion that describes the evolution of E |ii»;||2; — see, e.g., 


Eq. (22.21) further ahead. 


Variance Relation 
Thus, returning to (22.12), and replacing e; =(i) by its equivalent expression (22.10) in 
terms of e? (i) and e(i) we get 


2 


ICAR + a*(i)- e&()p = CARIE + PAOR e*(i) — M e(i) 


or, equivalently, after expanding the rightmost term, 


lE = deae + + Eie e(i)? 
- que? Oe - uj Oe) (22.15) 
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Alternatively, this result can be obtained by starting directly from the weight-error vector 
recursion (22.9) and equating the weighted Euclidean norms of both sides. 

Now using the data model (22.1) for d(i), it is clear that e(7) is related to e; (i) and v(i) 
as follows: 


e(i) = d(i) - uiwi-i = (uiv? + v(i)) - wiwi- = eali) + v(i) (22.16) 


so that substituting into (22.15) we can eliminate e(i) and get 


lb = H*c (22.17) 
2 .J2 2 V2 
ds H E |e, (2)|? TS H A le(i)? 
H i uil; E wul v'(i ; 
dL giu] v(ie;(i) + "zu (i)ea(i) 


- Faye Heal - j OT O 


- gu] e OEO - Fay? We 


Most of the factors in this equality disappear under expectation, while other factors can be 
expressed in terms of w,;_1. To see this, observe first that 


eX (ies (i) = W}_ufuDwi-1 


and 
ez (iea (i) F w; 1 Du uiw; 1 


so that the first terms on the fourth and fifth lines of (22.17) can be grouped together as a 
single weighted norm of w,;_; as follows: 


uj uestes quje OEO D guj lslburus casus 


Likewise, 
lea (i)|? = = wi 1Ui Uu, 


so that (22.17) becomes 


lé = l-l 
+ EH qu. ie, 
+4 Ec Oh 
CIE —— 
+ em v(i)ez() + Ë Khule v” (ijea(i) 
E que = qu? Were (22.18) 


Taking expectations of both sides of (22.18) we find 


Ele = INT I là: ile. - 


ZiCoL E EE (EF) 
(22.19) 


where the expectations of the cross-terms involving v() evaluate to zero due to the model- 
ing assumption on v(i) from (22.1), namely, that v(i) is zero-mean, i.i.d., and independent 
of uj for all j. Alternatively, we can obtain (22.19) by starting from the weight-error recur- 
sion (22.9), equating the weighted norms of both sides, and taking expectations to arrive at 
(22.19) — see, e.g., the result of Prob. V.27 specialized to a = 0. 

We can further include some of the multiplicative factors in (22.19) into the weighting 
matrices and write 


Ced uus 


179] 


Elw;|l = Elit + E (ratus 


EE (ed pesas + E (Hee) 


This equality can be written more compactly as follows. Introduce the random weighting 
matrix 


(22.20) 


In accordance with our convention, we are using a boldface symbol X’ since X’ is a random 
quantity (owing to its dependence on the regressor u;). Then 


In other words, starting from the energy conservation relation (22.12), expanding it, and 
expressing whatever factors possible in terms of t»; 1, and then taking expectations to 
eliminate cross-terms involving the noise variable v(i), we arrive at the variance relation 
(22.21). This relation shows how the weighted mean-square norm of à»; propagates in 
time. In particular, observe the important fact that the weighting matrices at the time 
instants 7 and i — 1 are distinct and related via (22.20). Moreover, it is shown in Prob. V.2 
that D’ > 0 when X > 0. 


[nili 


2221 
gu] ped 


E jl: | = E (lõ) + p202E ( 


Ae ee A TOTUM E 
Theorem 22.3 (Weighted variance relation) For adaptive filters of the form 
(22.2), Hermitian nonnegative-definite matrices X, and for data {d(i), ui} 
satisfying model (22.1), it holds that 


2. 2 2, ap ( lull 
E (I|; | x) +H im (Ge i i 
Y" = D- pup S “iiy 4 ME 
ead Agu] A gu] 
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E || walle 
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Independence Assumption 

The recursion of Thm. 22.3 provides a compact characterization of the time-evolution (i.e., 
dynamics) of the expectation E ||i»;||2.. However, more is needed in order for this recursion 
to permit a tractable transient analysis. This is because the recursion for E ||w;||% is hard 
to propagate as it stands due to the presence of the expectation 


E (lõi) = E(wi Xi) (22.22) 


This expectation is difficult to evaluate because of the dependence of ©’ on u;, and the 
dependence of Ù;—, on prior regressors (so that 1b;..; and X’ are themselves dependent). 
These dependencies are among the most challenging hurdles in the transient analysis of 
adaptive filters. One common way to overcome them is to resort to an independence as- 
sumption on the regression sequence {w,;}, namely to assume that 


The sequence {u,} is independent and identically distributed (22.23) 


Actually, it is customary in the literature to start the transient analysis of an adaptive filter 
with the collection of independence assumptions that were described in Sec. 16.4, which 
included, in addition to the independence condition (22.23), a Gaussian requirement on u; 
as well. Our arguments will not require the regressors to be Gaussian. 

Although invalid in general, especially for tapped-delay-line implementations, the in- 
dependence assumption (22.23) is widely used due to the simplifications it introduces into 
the arguments. Without (22.23), the study of the transient behavior of adaptive filters can 
become highly challenging, even for simple algorithms. However, there are results in the 
literature that show that performance results that are obtained using the independence as- 
sumption are reasonably close to actual filter performance when the step-size is sufficiently 
small. We discuss some of these results in App. 24.A. It is for this reason, and in order 
to simplify the presentation in the chapter, that we shall continue our discussions by using 
(22.23). Recall, in comparison, that in Part IV (Mean-Square Performance) we studied the 
steady-state and tracking performance of adaptive filters without resorting to the indepen- 
dence assumption. This observation explains why we have opted to order Chapters 15-22 
in their present order with Chapter 22 on transient analysis coming last. It is because tran- 
sient analysis is usually more demanding in terms of conditions and restrictions on the 
data. In contrast, as was already shown in Part IV (Mean-Square Performance), much can 
be said about the performance of adaptive filters under less restrictive assumptions. 

Now note that condition (22.23) guarantees that 


ib; is independent of both X and u; (22.24) 


This is because w#,_, is a function of past regressors and noise, (u;, v(j), j < i) (cf. the 
explanation after (16.12)), while £ is a function of u; alone. Using (22.24) we can then 
split the expectation E (||»;..; ||2.,) into 


E (fiiit) = E (lalèn) (22.25) 


with the weighting matrix X/ replaced by its mean, and which we denote by ©’, i.e., 


Equality (22.25) follows from the identities 


E (lùi) = Ew] S'i 
= E(Efíf ,E'i; ifi; i]) 
= Edw; ,[E(X'w1)]iii 
= E(Wwj,(EX)w;) because of (22.24) 


= E (lail) 


Thus, observe that the main value of the independence assumption (22.23) lies in guar- 
anteeing that w,;_1 is independent of >’, in which case it is possible to use (22.25) and 
thereby simplify the subsequent derivations. We can see from the expression for X in 
(22.20) that this same conclusion will hold if we replace condition (22.23) by the assump- 
tion that w;_1 is independent of u£u;/g[u.]. 

In this way, recursion (22.21) is replaced by 


(22.26) 


E |i; = Elisa ll, + p2o2E (15:8) 
g?(ui] 


with two deterministic (not random) weighting matrices (2, X) and where, by evaluating 


the expectation of (22.20), 
E : E IEU ta, 
( ) Pr elu] ) 


uš Ui 
» ês- use ( i J 
g[ui] 
It is further argued in Prob. V.2 that ©’ > 0 when X > 0. 


(22.27) 
g(ui) 


————— Á—————— 
Theorem 22.4 (Weighted variance relation with independence) For 
adaptive filters of the form (22.2), Hermitian nonnegative-definite matrices 
X, and for data {d(i), u;} satisfying model (22.1) and the independence 
assumption (22.23), it holds that: 


2 
E || = E |l + u?o2E (Ge) 


4 
utu; uui AIE 
Y =E — „gE Le — uE ile + «e ( Z utu; 
2 (Gat) Es E = gu] 


Observe that the expression for X’ is data dependent only; i.e., it depends on u; alone 
and does not depend on the weight-error vectors. In this way, the recursion for ||iv;||2. is 
decoupled from the computation of X’ in the statement of theorem. Moreover, the expres- 
sions for E |i»; || and X show that studying the transient behavior of an adaptive filter 
requires evaluating the three multivariate moments: 


i (135). E E) u ) (22.28) 


which are solely dependent on w;. Note further that the last moment in the above list 
appears multiplied by su? in the expression for X/. What this means is that sometimes, 
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when the step-size is sufficiently small, this last moment could be ignored in lieu of sim- 
plification; see the remark following Thm. 24.1 and also Probs. V.14 and V.38, where this 
observation is pursued in greater detail. 

Finally, taking expectations of both sides of (22.14), using (22.23) and the fact that v(i) 
is independent of u;, we obtain the following result for the evolution of the mean of the 
weight-error vector. 


Theorem 22.5 (Mean weight error recursion) For any adaptive filter of the 
form (22.2), and for any data (d(i),u;) satisfying (22.1) and (22.23), it 
holds that 


uui 
Ew; = |I — uE D j| Ew; 22.29 
> | y (st) d d 


Convenient Change of Coordinates 

Evaluation of the moments (22.28), and the subsequent analysis, can at times be sim- 
plified if we introduce a convenient change of coordinates by appealing to the eigen- 
decomposition of R, = E u7u;. Solet 


R, = UAU* (22.30) 


where A is diagonal with the eigenvalues of Ru, A = diag{A,}, and U is unitary (i.e., it 
satisfies UU* = U*U = I). Then define the transformed quantities: 


(22.31) 
Since U is unitary, it is easy to see that 

[d] = lw: ^ and — feel = lz: (22.32) 
For example, 


|w;lz = Wow, = (a;U) (UEU). (U'&;) = Ww; = juil 


Likewise, for |u;||2. In the special case © = I, we have X = I, ||w;|? = ||ib;||?, and 
all? = lius]. 

Now under the change of variables (22.31), the variance relation of Thm. 22.4 retains 
the same form, namely 


FY -U*x'U (22.34) 


- - alg 
E jw: = Elw- + pode | = 


22.33 
g?[u.] oy 


where 


The data nonlinearity g[-] is usually invariant under unitary transformations, i.e., g[u;] = 
g[u;]. This property is obvious for LMS and e—NLMS where g[u;] = 1 and g[u;] = 
€ + ||u;||?, respectively. However, the invariance property of g[-] is not necessary for our 
development; if it does not hold, then we would simply continue to work with g[u;] instead 
of g[u.]. 


Continuing, from the equation for X’ in Thm. 22.4 we find that 


< puru; utu IKA 
SE |—— | -— uE x ?E uui 22.35 
[iud] ^^ ES Ds ome M 


In the discussion that follows, we shall use either the standard relations (22.26)-(22.27) 
or their transformed versions (22.33)-(22.35); the transformed versions are particularly 
useful when the regressors u; are Gaussian (as we shall see in Sec. 23.1). 


Theorem 22.6 (Transformed weighted-variance relation) For any adaptive fil- 
ter of the form (22.2), any Hermitian nonnegative-definite matrix X, and for 
data [d(i),uj) satisfying model (22.1) and the independence assumption 


(22.23), it holds that: 


EJT = Elw.ili + po? t [Ee 
v wow] [un ag [Ili 
Do = E-rea ve [BE + we | uj | 


where the transformed variables (15;, Wi, X, X) are related to the original vari- 
ables {t;, u;, ©, /) via (22.31) and (22.34), so that E |w,||2 = E || wi (I$. 


Likewise, the transformed version of the mean-weight error recursion (22.29) is 


(22.36) 


The purpose of the chapters that follow is to show how the above variance and mean 
relations can be used to characterize the transient performance of data-normalized adaptive 
filters. In particular, it will be seen that the freedom in selecting the weighting matrix X 
can be used to great advantage in deriving several performance measures. 
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LMS with Gaussian Regressors 


W. use the mean and variance relations of the last chapter to study the transient perfor- 
mance of the LMS algorithm, 


wi = Wi- pu;e(i) (23.1) 
for which the data normalization in (22.2) is given by 
gui = 1 (23.2) 
In this case, relations (22.26)-(22.27) and (22.29) become 
E |!ab || = E li-l} + w?oSE llwall 
X = D — XE [usu] — pE [ufui] E + pE [|ui|$uzu;] (23.3) 
Eù; = [I — pE (ufuj)] E Ù; 
We therefore need to evaluate the three moments: 
Eutu;,  Eluj|à and  E|lee,||Ruta, (23.4) 


The first two moments are obvious, and can be evaluated regardless of any assumed distri- 
bution for the regression data since 


Euu; = Ru (by definition) (23.5) 
Elu|À = Eu;Xuwi- ETr(ufuX) = Tr(R,X) (23.6) 


The difficulty lies in evaluating the last moment in (23.4). To do so, we shall treat two 
cases. First we treat the case of Gaussian regressors for which the last moment can be 
evaluated explicitly. Afterwards, we treat the general case of non-Gaussian regressors. 


23.1 MEAN AND VARIANCE RELATIONS 


We assume in this chapter that the regressors (u;) arise from a circular Gaussian distri- 
bution with covariance matrix Ru (cf. Sec. A.5).We say circular because we are treating 
the general case of complex-valued regressors; otherwise, the circularity assumption is not 
needed. In the Gaussian case, as we shall explain ahead following (23.11), it is more con- 
venient to work with the transformed versions (22.33)-(22.35) and (22.36) of the variance 
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and mean relations, which for LMS are given by 
Ew; = Eloi, + pE slg 


DY =D XE [uruj- pE [ata] E + pE [iria] (23.7) 
= [I - pE (u?u;)) Ewi-i 
The moments that we need to evaluate in the transformed domain are 
ETT, ET] and El[u;|Luru (23.8) 
where the first two are again immediate to compute since 
E u;u; = A (23.9) 


and 
E |u| = Ew;Eu; = ETr(utas) = T(AE) (23.10) 


With regards to the last moment in (23.8), we use the fact that u; is circular Gaussian with 
a diagonal covariance matrix, and invoke the result of Lemma A.3, to write 
E|u|Zu;yu; = E (Duy) uT 
E u; (u;Eu;) a; 
ATr(ZA) + ADA (23.11) 


Recall that the statement of Lemma A.3 requires the variable z to have a diagonal co- 
variance matrix, which explains why we introduced the transformed vector u; and the 
transformed relations (23.7) in the Gaussian regressor case. Moreover, if the regressors 
were real-valued rather than complex-valued, then we would invoke Lemma A.2 and use 
instead 

E|u;|5ufu; = AT(ZA) + 2AEA 


with an additional factor of 2 compared to (23.11). 
Using (23.9)-(23.11), recursions (23.7) become 


Elm = Elm- + w?o2Tr(AS) (23.12) 
E = E-yEA- AE + py? [ATe(SA) + ADA] (23.13) 
Ew; = (I— pA) Ew;-1 (23.14) 


Observe the interesting fact that F will be diagonal if X is. Now since we are free to 
choose 2: and, therefore, ©, we can assume that X is diagonal. Under these conditions, it 
is possible to rewrite (23.13) in a more compact form in terms of the diagonal entries of 


(EE. 


Diagonal Notation 
To do so, we define the vectors 


a Ê diag(X) and A ^ diag{A} (23.15) 
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That is, (c, À} are M x 1 vectors with the diagonal entries of the corresponding matrices; 7 
contains the diagonal entries of X, while A contains the diagonal entries of A. Actually, in 
this book, we shall use the notation diag{-} in two directions, both of which will be obvious 
from the context. Writing diag{ A}, for an arbitrary matrix A, extracts the diagonal entries 
of A into a vector. This is the convention we used in (23.15) to define (c, A}. On the other 
hand, writing diag{a} for a column vector a, results in a diagonal matrix whose entries are 
obtained from a. Therefore, we shall also write, whenever necessary, 


YX-dig(v) and <A=diag{\} (23.16) 


in order to recover (X, A} from (2, A). 


Linear Vector Relation 


Now in terms of the vectors (c, A}, it is easy to see that the matrix relation (23.13) is 
equivalent to the vector relation 


a = (I—2wA+p?A?)a + p? (ATS) A 
which can in turn be written more compactly as 


(23.17) 


with an M x M coefficient matrix F defined by 


F Ê (1- 24A + pA?) + AAT Q3.18) 


Expression (23.17) shows that the relation between the diagonal elements of © and © is 
actually linear. Moreover, since x e diag(c"), the linear relation (23.17) translates into 
the matrix relation ©’ = diag {Fo}. 


Variance Relation 
We can rewrite recursion (23.12) by using the vectors {7,@’} instead of the matrices 
1»; E}. Using (23.17) and the notation (23.16), recursion (23.12) is equivalent to 


E Pilling roy = E 0:1 läiag{Fa} + y?02(A"S) 
where, for the last term, we used the fact that Tr(AD) = A'G. For compactness of notation, 


we shall drop the diag{-} notation from the subscripts and keep the vectors only, so that 
the above will be rewritten more compactly as 


Ew]? = E [wii]. + WoA) (23.19) 


The vector weighting factors (c, Fc) in this expression should be understood as compact 
representations for the actual weighting matrices (diag(c), diag( Fo) ). In other words, if 
c is any column vector, the notation ||z||2 is used to mean 


æl? à liz ling cos = z*'Xzr, where © diag(c) 


In summary, starting from (23.12)-(23.14), we argued that for Gaussian regressors the 
weighting matrix © is diagonal if © is chosen as diagonal, so that (23.12)-(23.13) can be 
equivalently expressed more compactly as in (23.17)-(23.18) and (23.19), namely, 


Elm} = Elie, + Hol) (23.20) 
v = Fo (23.21) 
F = (1-2uA+p?A?) + paT (23.22) 
Ew; = [l-pAJEW-1 (23.23) 


Stability and performance analyses are now possible to pursue by using these relations. 
Recall that in transient analysis we are interested in the time evolution of the expecta- 
tions (E w,, E | »;||?) or, equivalently, {E W;, E ||; |?) since W; and dv; are related via a 
unitary matrix as in (22.31). We start with the mean behavior. 


23.2 MEAN BEHAVIOR 


The behavior of E tv; can be inferred from (23.23). Thus, note that since, by assumption, 
the initial condition is zero, w..; = 0, we get 


w- = we — Wii = w? 
or, equivalently, 
i A 
Dı = Uw = uw? 


The vector w° is modeled as an unknown constant so that E W1 = w°. Therefore, iterat- 
ing (23.23) we find that l 
Ew; = (I- uA)? u? 


We can now derive a condition on the step-size in order to guarantee convergence in the 
mean, i.e., in order to ensure that EW; — 0 as i — oo, which is equivalent to Ew; — 0. 
Indeed, since I — pA is a diagonal matrix, the condition on p is easily seen to be: 


|1- uàk|<1 fork —1,2,..., M 


where the (A, } are the entries of A (i.e., the eigenvalues of R,,). In other words, u should 


satisfy 
H € 2/Amax (23.24) 


where Amax is the largest eigenvalue of R,. For such step-sizes, it follows that 


lim Ew; = w° 
1—00 


and we say that the LMS filter is convergent in the mean and, hence, asymptotically unbi- 
ased. 


23.3 MEAN-SQUARE BEHAVIOR 


The study of the mean-square behavior of the filter is more demanding and also more 
interesting. We start by noting that the desired quantity E |i; |l? which is also equal to 
E |[[35;||?, can be obtained from the variance recursion (23.20) if X is chosen as X = I (or, 
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equivalently, £ = I in view of (22.31)). This corresponds to choosing @ as the column 
vector with unit entries, i.e., 


2 =| col{1,1,...,1,1} Ê q (23.25) 


which we denote by q. Then (23.20) gives 
E [:;|? = E mi-l, + wot (a) (23.26) 


This recursion shows that in order to evaluate E ||15; ||? we need to know E |W; lbe with 
a weighting matrix whose diagonal entries are Fg. Now the quantity E lilh can be 
inferred from the variance relation (23.20) by writing it for the choice ¢ = Fq, namely, 


E |[t.ll%, = E |W: llo + po? (A Fq) (23.27) 
We now find from (23.27) that in order to evaluate the term 
25.12 
E [will 


we need to know 
EJT- 


with a weighting matrix defined by the vector FA. This term can again be inferred from 


(23.20) by writing it for the choice 7 = Fg 


Eile, = Epica] e, pao? (ATF'a) (23.28) 


p T z $ T3 ; 
and a new term with weighting matrix determined by F q appears. The natural question 
is whether this procedure terminates, and whether weighting matrices that correspond to 
increasing powers of F keep coming up? The procedure does terminate. This is because 


once we write (23.20) for the choice 7 = Fg we get: 


e = qp(M-1 
E Diller, = Elige, wo (ATF Mg) — 0329 


where the weighting matrix on the right-hand side is now FM, However, we do not need 
to write (23.20) for the choice F = FM in order to determine E ||w; I. My This is because 
this last term can be deduced from the already available weighted factors: 


(Et, Em, Edo, es El Rec. } 0330 


This fact can be seen as follows. With any matrix F we associate its characteristic polyno- 
mial, defined by 
p(x) = det(zI — F) 


It is an M —th order polynomial in z, 


p(z) = a + pyr! +pm-217? +... + pit + po 


with coefficients (pz, pm = 1} and whose roots coincide with the eigenvalues of F. Now 
a famous result in matrix theory, known as the Cayley-Hamilton theorem, states that every 
matrix satisfies its characteristic equation, i.e., p(F) = 0. In other words, F satisfies 


—M =M-1 =M-—2 = 
F +pm-iF T pMu-2F +...tpF + poly =0 


which means that the M —th power of F can be expressed as a linear combination of its 
lower order powers. Using this fact we can write 


2 = z512 
Elwa = ENN FM py a PA 7. pF pola 
= —poE |/wil|? — mE lw, — e. — Par-iE [illos 
That is, 
M-1 
E|R2.l o, = 2. -pE [filles (23.31) 
=0 


which expresses E Wille, as a linear combination of the terms in (23.30), as desired. 


We can collect the above results into a single self-contained recursion by writing (23.26)— 
(23.31) as: 


E ||; || j i E |o; ||? 
Ew; T: o 0 1 E (wile, 
E |W; T 0 0 0 1 E Ri: zo. 
E (fiv; pU. 0 0 0 1 E Ri il ocn, 
E |W; P-D —Po -p —P2 —PM-1 E [Ti-l -o 
A w, ne 
Ag 
ATFq 


If we introduce the vector and matrix quantities (W;, F, Y) indicated above, then this 
recursion can be rewritten more compactly as 


Wi = Fwi-i + Roy (23.32) 


This is a first-order recursion with a constant coefficient matrix F. In the language of lin- 
ear system theory, a recursion of the form (23.32) is called a state-space recursion with 
the vector W; denoting the state vector. We therefore find that the mean-square behavior 
of LMS is described by the M —dimensional state-space recursion (23.32) with coefficient 
matrix F. To be more precise, the transient behavior of LMS is described by the combi- 
nation of both (23.32) and recursion (23.23) for the mean weight-error vector. These two 
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recursions can be grouped together into a single 2M -dimensional state-space model with 
a block diagonal coefficient matrix as follows 


Ew; E I— pÀ EU; 2.2| 0 
ESI ELLE 


However, since the mean behavior is simpler to study, as we saw before, while the mean- 
square behavior is considerably richer, and since the combined state-space model is de- 
coupled due to the diagonal block structure of its coefficient matrix, we shall often, for 
simplicity, refer to the state-space model (23.32) as the model that ultimately determines 
the transient behavior of an adaptive filter. 

The matrix F in (23.32) has a special structure in this case; it is a companion matrix 
(namely, a matrix with ones on the upper diagonal and the negatives of the coefficients of 
p(x) in the last row). A well-known property of such matrices is that their eigenvalues 
coincide with the roots of p(z) — 0, which in turn are the eigenvalues of F from (23.22), 
i.e., 


[eigenvalues of F ) = {roots of p(z) } = { eigenvalues of F } 


Recursion (23.32) shows that the transient behavior of LMS is the combined result of the 
time evolutions of the M variables in (23.30), which are the entries of W;. For this reason, 
these variables are called state variables; since they determine the state of the filter at any 
particular time instant. 


Lemma 23.1 (Mean-square behavior of complex LMS) Consider the 


LMS recursion (23.1) and assume the data {d(i), u;} satisfy model (22.1) 
and the independence assumption (22.23). Assume further that the regressor 
sequence is circular Gaussian. Then the mean and mean-square behaviors of 
the filter are characterized by (23.23) and (23.32), namely, by the recursion 


E15; m I— uA EW; 2.2 0 
Iw Inl e] ] * 775] 


23.4 MEAN-SQUARE STABILITY 


The LMS filter will be said to be mean-square stable if, and only if, the state vector W; 
remains bounded and tends to a steady-state value, regardless of the initial condition W. ,. 
A necessary and sufficient condition for this to hold can be found as follows. For any 
state-space model of the form 

Ti = Caii + b 


a well-known result from linear system theory states that the sequence {x,;} remains 
bounded, regardless of the initial state vector {x—; ), and that the sequence (7;) will tend 
to a finite steady-state value as well, if, and only if, all the eigenvalues of C lie inside the 
open unit disc. Applying this result to (23.32), we conclude that the LMS filter will be 
mean-square stable if, and only if, all eigenvalues of F are inside the unit disc or, equiva- 
lently, the eigenvalues of F from (23.22) should satisfy 


-1«A(F)«1 (23.33) 


That is, we need F to be a stable matrix; here we are writing A(F) to refer to the eigenval- 
ues (spectrum) of F. 

The lower bound on A(F) is automatically satisfied because F is at least nonnegative- 
definite (and, hence, its eigenvalues are all nonnegative). This fact can be seen by writing 
F from (23.22) as the sum of two nonnegative-definite matrices: 


F = (1-2uA+ pA?) + AA = (I-A)? pA (23.34) 


To find a condition on y for the upper bound on the eigenvalues of F to be satisfied, we 
start by expressing F as E 
F=I1-pA+p?B (23.35) 


where, in this case, the matrices A and B are both positive-definite and given by 
AZ 2A, BS MAT (23.36) 


It is shown in Prob. V.3 that for nonnegative-definite matrices F of the form (23.35), its 
eigenvalues will be upper bounded by one if, and only if, the parameter yz satisfies 


0< u < IX (AB) (23.37) 


in terms of the maximum eigenvalue of A~!B. It is also shown in that problem that the 
eigenvalues of A^! B are real and positive. 
Let 
n? = 1/Amax(A~*B) 


The bound (23.37) on y is simply the smallest positive scalar 7) that makes the matrix 
I — 5A71 B singular, i.e., it is the smallest 7 such that 


det (I — A^! B) =0 (23.38) 


Combining (23.37) with (23.24) we find that the condition on y for the filter to converge 
in both the mean and mean-square senses is 


0 < u < min{2/Amax,7°} (23.39) 


It turns out that we can be more explicit and show that 7° < 2/Amax so that the upper 
bound on p is ultimately determined by 7° alone. Indeed, using the definitions (23.36) for 
A and B we have 


-1 
det (I — 7A~*B) det (: = nS [A? + x] 


det (1— 2 [A & gA']) 


We are interested in the smallest value of 7 that makes I — 7A~+B singular. Actually, in 
view of condition (23.39) on à, we are only interested in those values of 7) that lie within 
the open interval (0, 2/Amax). If any such 7 can be found, then 7° < 2/Amax. Since over 
the interval € (0,2/Amax) the matrix (I — 2A) is invertible, we can write 


det (1- Za) -aet (1- [r= 4] ^ Zax) 


(1- AT [21-A] a) det (1- 24) | 0340 


det (I — nA B) 
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for any matrices ( X, Y ) of compatible dimensions, so that when X is a column and Y is 
arow we have 
det(I— ay") = 1- y'z 


The values of n € (0, 2/Amax) that result in det (I — 74^! B) = 0 should therefore satisfy 


AT (I-A) q-1 


SERE 

Veet! (23.41) 

hei” ^k 

Introduce the function 
M 
^ Arn 
NE wr (23.42) 

k=1 


and observe that f(0) = 0 and f(n) is monotonically increasing between 0 and 2/Amax: 
this latter claim follows by noting that the derivative of f(-) with respect to 7 is positive, 


M 
df 2Àk 
— =) ——5 > 0 fornge([0,2/Amax 
dn f (2 — Axn)? nea 


Observe further that f(7) has a singularity (i.e., a pole) at 7 = 2/Amax, so that f(n) — oo 
as n > 2/Amax — Fig. 23.1. 

Therefore, we conclude that there exists a unique positive n within the interval (0, 2/Amax) 
where the function f(7) crosses one, i.e., for which 


Plot of f(n) 


2/. 
max 


FIGURE 23.1 Behavior of the function f(r) defined by (23.42) over the semi-open interval 
(0, 2/Amax). 


This value of 7 is the smallest rj that makes I — A^ 1 B singular and, therefore, it coincides 
with the desired 7° and is smaller than 2/Amax. In conclusion, we find that condition 
(23.39) is equivalent to requiring y to satisfy 


When this happens, the LMS filter will be mean-square stable. In summary, we arrive at 
the following conclusion. 


Theorem 23.1 (Stability of complex LMS) Consider the LMS recursion (23.1) 
and assume the data {d(i), u;} satisfy model (22.1) and the independence 
assumption (22.23). Assume further that the regressor sequence is circular 
Gaussian. Then the LMS filter is mean-square stable (i.e., the state vector 
W, remains bounded and tends to a finite steady-state value) if, and only if, 
the positive step-size u satisfies 


M 

À 
Y ei 
pai lT Akh 


where the {Ax} are the eigenvalues of R,. The above condition on p also 
guarantees convergence in the mean, i.e., Ew; — w°. 


The arguments leading to the theorem were carried out for complex-valued regressors. If 
the regressors were real-valued, then, as mentioned after (23.11), this expression would 
include an additional factor of 2, namely 


E\la,|2a7a; = ATr(XA) + 2AXA 


If we propagate this factor of 2, then only slight modifications will occur in the derivation. 
In particular, expressions (23.20)-(23.23) would become 


Elm |Z = Elmli, + poz) (23.43) 
7 = Fe (23.44) 
F = (1-2uA+2u?A?) + pa (23.45) 
Eu; = (I—-pA)EW;-1 (23.46) 


Moreover, the expression for the matrix B in (23.36) would become B = 2A? + AAT, and 
condition (23.41) on 7 would be replaced by 


ly wn EU 23.47 
po = (23.47) 
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Theorem 23.2 (Mean-square behavior of real LMS) Consider the LMS recur- 
sion (23.1) and assume the data {d(z), u;) satisfy model (22.1) and the in- 
dependence assumption (22.23). Assume further that the regressor sequence 
is real-valued Gaussian. Then the LMS filter is mean-square stable (i.e., the 
state vector W; remains bounded and tends to a finite steady-state value) if, 
and only if, the step-size u satisfies 


px Ak 


z «1 
2 fay 1 Anu 


where the {Ax} are the eigenvalues of Ru. The above condition on u also 
guarantees convergence in the mean, i.e., Ew; — w°. Moreover, the mean 
and mean-square behaviors of LMS are characterized by recursions (23.23) 
and (23.32), namely, 


Ew; | | l-pA Ej. 22] 0 
[Se TA e [D | + ee] 9 | 


(or, more generally, by (23.48) further ahead for other choices of 2 in the 
mean-square case). The coefficients {p,} that define the matrix F are now 
obtained from the characteristic polynomial of the matrix F in (23.45). 


23.5 STEADY-STATE PERFORMANCE 


In the above derivation, we used the variance relation (23.20) to characterize the transient 
behavior of LMS in terms of the first-order recursion (23.32). We can use the same variance 
relation to shed further light on the steady-state performance of LMS. Of course, we al- 
ready studied the steady-state performance of LMS in Chapter 16, and derived expressions 
for its (excess) mean-square error under varied conditions on the step-size and the data. 
Here we shall use the transient analysis of this section to provide additional insights under 
the independence assumption (22.23). Clearly, any steady-state results that are obtained 
by examining the limiting behavior of a transient analysis would be bound by the same 
assumptions/restrictions that are needed to advance the transient analysis. In contrast, the 
steady-state analysis that we carried out in Chapter 16 was purposely decoupled from the 
transient analysis and, in this way, it relied on weaker assumptions than those needed in 
the current chapter. 
In this section, we shall re-examine the excess mean-square error, 


EMSE Ê lim Ele,(i)]? 
i CO 
as well as study the so-called mean-square deviation of the filter, which is defined as 
MSD © lim E |[i;|]? 
$00 
To begin with, let us first note that if model (23.32) is stable, then it will remain stable 
if q is replaced by any other choice for v. Indeed, it is straightforward to see from the 


arguments that led to (23.32) that had we started with any other choice for c, a similar 
state-space recursion would have resulted with the same coefficient matrix F, namely 


Elia ig 0 1 E [i-i] 

E [iilis 0 0 1 E [sally 

E |w; lf> 0 0 0 1 E |wi- lp 
E[fmi l3 rco 0 0 0 1 E [8.113 n.. 
E |T; (— -Po -Pı -P2 —PM-i E |W; — 

—M—— 
aw, d 
NE 


í 
with q replaced by 2. In other words, it would still hold that 


Wi = FWi-1 + i?202y (23.48) 


where the entries of W; are now defined as above for arbitrary F, while the coefficient 
matrix F remains the same as before. Therefore, no matter which F we choose, stability 
of (23.48) is guaranteed by the same condition on F as in (23.33), namely, its eigenvalues 
should lie inside the unit disc. 

With this issue settled, we can now explain how the freedom in selecting F can be useful. 
To see this, consider the setting of Thm. 23.1 and assume the step-size u has been chosen 
to guarantee filter stability. Then recursion (23.20) becomes in steady-state 


E || WoollF = E|ws|. + KoA) (23.49) 


which is equivalent to 
E [Woolly ae = y?o2(ATv) (23.50) 


This expression can be used to examine the mean-square performance of LMS in an inter- 
esting way. 

For example, in order to evaluate the MSD of LMS we would need to evaluate the 
expectation E |W ||. Thus, assume that we select the weighting vector 7 in (23.50) as 
the solution to the linear system of equations (I— F)9 sa = q,ie.,a$7qsq = (I- F)-1q. 
In this way, the weighting vector that appears in (23.50) will become g. Then the left-hand 
side of (23.50) will coincide with the filter MSD and, therefore, we would be able to 
conclude that 


MSD = 42e2AT(I— F) tq (23.51) 


A more explicit expression for the MSD can be found by evaluating the product AT(I-— 
F)^!q. Using expression (23.22) for F we have 


I-F = 24A — pA? PAX. Ê Dp 
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where we introduced, for convenience, the matrix 
D = 2uA — p? A? (23.52) 
Then, using the matrix inversion formula (5.4), 


A-F) = AT(D -= pa) 


= XD... prp! 
= (p ti—gypauD WD a 
ATD-!q 
I-AD) oa 
Substituting (23.53) into (23.51), we get 
(23.54) 


In a similar vein, we can evaluate the EMSE of LMS. Thus, recall from (22.24) that u; 
and Ùw;—; are independent random variables. Then it follows that 


E |ea(i)|? 


n 
Eti; juu Wi-i 


E [E (wj_,ufust,_1 | @i-1)] 

= E [@j_, E (užu; | d-1) i1] 

= E (à. , Rib; 1) 

= |&;-il2, 

= |©- (23.55) 


In other words, in order to determine the EMSE we need to evaluate E |W ||}, with 
weighting factor A = diag{A}. Therefore, assume that we now select F in (23.50) as the 
solution to the linear system of equations (I—F)Femse = A, ie. aS Temse = (I—F)- 1A. In 
this way, the weighting quantity that appears in (23.50) will become A. Then the left-hand 
side of (23.50) will coincide with the filter EMSE and we would get 


EMSE = ,42e2A' (I— FIA (23.56) 


Again a more explicit expression for the EMSE can be found by evaluating the product 
AT(I — F)71A in much the same way as we did for the MSD above, leading to 
ATDI 


Tap: t mea ind A ANE 
A- FA aD 


so that 


The above derivations assume complex-valued regressors. If the regressors were instead 
real-valued, then only slight modifications will occur. In particular, we would need to use 
expression (23.45) for F, in which case the matrix D in (23.52) would be replaced by 
D = 2uA — 20? A?. Repeating the derivation for the MSD and EMSE leads to the result 
summarized below. 


Theorem 23.3 (MSD and EMSE of LMS) Consider the LMS recursion (23.1) 
and assume the data (d(i), u;} satisfy model (22.1) and the independence 
assumption (22.23). Then the MSD and EMSE are given by 


M M 
2 Àk 2 1 
Hoy Y athe MP PES 
EMSE = —*="_____ MSD = = 
Ak Ak 
1- di l1- 2 2—suAÀk 


where s — 1 if the regressors are circular Gaussian and s — 2 if the regressors 
are real-valued Gaussian. Moreover, the {Ax} are the eigenvalues of Ry. 


23.6 SMALL STEP-SIZE APPROXIMATIONS 


If the step-size is small enough in the sense that LA; « 1, then the above expressions for 
the EMSE and MSD simplify to 


poy TR) and MSD = aM, 


ERE 2 — uTr( Ra) 2 — pTr(R,) 


(23.57) 


For even smaller step-sizes, we can further approximate the denominators by 2 so that 
EMSE = po2Tr(R,)/2 and MSD = uo? M/2 (23.58) 


If, on the other hand, the covariance matrix R, is a scaled multiple of the identity, say, 
R= of, then the expressions for EMSE and MSD from the theorem reduce to 


2.2 2 
pMeyos and Mepa c M 


d 2 — u(M + s)o2 2 — u(M + s)o2 


(23.59) 


These expressions for the EMSE coincide with the ones derived earlier in Lemma 16.1. 
It is reassuring to see how the arguments of Chapter 16 and the arguments of the current 
chapter complement each other. Keep in mind though that the derivation here relies on 
the independence condition (22.23), while the derivation of Chapter 16 did not require 
(22.23). Observe further how the expressions for the filter EMSE and MSD were obtained 
by simply choosing convenient values for the free parameter 7. 


23.A APPENDIX: CONVERGENCE TIME 


Besides stability and steady-state performance, the transient analysis developed so far also provides 
information about other aspects pertaining to the operation of adaptive filters. In this appendix we 
illustrate how the convergence time of an adaptive filter can be estimated from the results on its tran- 
sient performance. We consider the example of an LMS filter with white regression data. The case 
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of correlated data, as well as more general data-normalized filters, are treated in App. 9.D of Sayed 
(2003). 
LMS with White-Input Gaussian Data 
Consider the LMS recursion 
wi = wi-1+puje(i), e(i)=d(i)—uiwi-1, w-1=0 


and assume the data {d(i), wi} satisfy model (22.1) and the independence assumption (22.23). As- 
sume further that the individual entries of w; are zero-mean, Gaussian, and i.i.d. with variance c2 
and fourth moment yo4, where y = 2 for complex-valued data and y = 3 for real-valued data. It 
then follows from Thm. 23.3 (see Prob. V.21) that the mean-square deviation of the filter evolves 
with time according to the following difference equation: 


E tb, ||? = [1 — Quod + wou(M * Y - 1] El&icill? + wototM (23.60) 


Now since, in this case, 


Elea(é)|? = on lõi]? 


and since also 
Ele(é)? = o3 + Elea)? 


some simple algebra will show that the learning curve of the filter evolves according to the recursion: 
Eje(i)|? = [1 — 2uot + woi(M + $ - 1)] Ele(é-1))? + nozez[2 — uly- lot] 
which we rewrite more compactly as 
Ele(i)? = aEle(i — 1)? + 8 (23.61) 


with the constants {a, 8) defined by 


A A 
o 2 1-Qmoit+woi(M+y—-1), 8 5 novos - u(y — 1)e2] 


Mean-square stability of the filter requires |o| « 1. Actually, it is easy to see that a > O in this 
example since a can be written as the sum of two positive terms, 


a-—(1- goi) + woi(M +y- 2) 


so that a must satisfy 0 < o < 1 for stability. It then follows from (23.61) that the steady-state 
mean-square error (MSE) of the filter is given by 


MSE = Ele(oo)|? 
8 
l-a 
oe [2 ~ u(y — 1)e2] 
2—po2(M++7-1) (23.62) 


To proceed, we shall characterize the convergence time of an adaptive filter as follows. 


Definition 23.1 (Convergence time) The convergence time of an adaptive filter is 
the number of iterations, X, that is needed for its mean-square error to reach (1+ €) 
times its steady-state value, for some given e > 0. That is, it is the time X at which 


Ele(K)|? = (1 + 9Ele(oo)* 


For example, choosing e = 0.1 amounts to requiring the mean-square error of the filter to be within 
10% of its steady-state value. To apply the definition to LMS, we first rewrite expression (23.61) 
more conveniently as 


(tiet - 2.) = a (Eec -df - 72) 


i.e., we center the mean-square error around its steady-state value. Iterating we obtain 
(Ele(i)|? - MSE) = a’ (E|e(0)|? — MSE) (23.63) 


Using 
e(0) = ea(0) + v(0) = uow? + v(0) 
|? 


(since dbi = w°), we can express E|e(0)|* in terms of the SNR at the output of the model filter 


w°. Specifically, we have 
2 0|2 
EET) = 02(1+SNR) 


v 


Ele(0)|? = o2 + 02|w°|? = e? ( E 


where the ratio 
SNR = c?|lw^|?/c2 


denotes the output SNR. Substituting into (23.63) gives 
(E |e(i)|? — MSE) = a‘ (e2(1-- SNR) - MSE) 
Setting 
i=K, Ele(K))? = (1+ )MSE 
and solving for X, we arrive at the expression 


KIn(a) = In EMSE | 


a2 (1 + SNR) — MSE 
We can rework this result into an equivalent form in terms of the filter misadjustment as follows. 


Recall that 
MSE = c? + EMSE 


with EMSE denoting the filter excess mean-square error (i.e., EMSE = E |ea(00)|?). Recall further 
that the filter misadjustment is M = EMSE/o2. Substituting into the expression for K we get 


in | e(1 + M) | 
SNR- M pak 
K= Ona. E (iterations) (23.64) 
Moreover, from (23.62), 
2.2 
yoyo, M 
MSE! = eu 
EMSE 2—po2(M+y7-1) 
so that 
ise: po2M 


2—po2(M+y7-1) 


The resulting expression (23.64) for the convergence time K is seen to be dependent on the step-size, 
the filter length, and the output SNR. 


Figure 23.2 plots the theoretical and simulated values of the convergence time K (for e = 0.1) and 
the theoretical values for the filter misadjustment M, against the step-size yz, for an LMS implemen- 
tation with 10 taps, unit-variance Gaussian input, and output SNR at 30 dB. The values obtained for 
K are listed in Table 23.1. The simulated values were obtained by averaging over 100 experiments 
and by running each experiment over 500000 iterations. It is seen from the top plot in Fig. 23.2 that 
the convergence time decreases as 4; increases, which is an expected behavior. The plot uses loga- 
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smaller values for better misadjustment vs. larger values for shorter convergence time. 


TABLE 23.1 Simulated and experimental convergence time (in terms of number of 
iterations) for an LMS implementation with 10 taps and using white Gaussian input. 


479501 | 92501 | 46501 | 9501 | 4501 


Convergence time and misadjustment of a 10-taps LMS implementation 


‘ with white Gaussian input and SNR = 30dB 
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FIGURE 23.2 Plots of the convergence time (top) and misadjustment (bottom) for a 10-tap 
LMS implementation with output SNR set at 20 dB. 
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LMS with non-Gaussian Regressors 


W. now drop the Gaussian assumption on the regressors and show how the variance 
relation of Thm. 22.3 can be used to study the performance of LMS in this case as well. 
Although we are using the words non-Gaussian regressors in the title of this chapter, the 
results herein include the Gaussian case as a special case as well. 


24.1 MEAN AND VARIANCE RELATIONS 


Thus, refer again to the transformed recursion (23.7), which characterizes the transient 
performance of LMS. When the regressors u; were Gaussian, we were able to evaluate the 
three moments below (see (23.9)-(23.11)): 

Buju, Elali and Elw|zuru (24.1) 
In particular, we found that E'u;u; and E |Ñ; |u; u; were simultaneously diagonal and 
that the weighting matrices (X, E} themselves could be made diagonal as well — see 
(23.13). 

However, when the regressors u; are non-Gaussian, it is generally not possible to ex- 
press the last moment in (24.1) in closed-form any longer (as we did in (23.11); see 
though Prob. V.11). In addition, and more importantly perhaps, the moments E u7'u; and 
E |[u;||;u; u; need not be simultaneously diagonal anymore. In this way, the weighting 
matrix ' need not be diagonal even if X. is. 

Nevertheless, the transient analysis of LMS can still be pursued in much the same way as 
we did in the Gaussian case if we replace the diag{-} notation in (23.15) by an alternative 
vec{-} notation. Before doing so, we remark that since the weighting matrices [xj 
are not necessarily diagonal anymore, we shall pursue our analysis by working with the 
original variance and mean relations (23.3), instead of the transformed variance and mean 
relations (23.7), namely, we shall now work with 


Elm = Elé&;il£ + w?o2E |u: 
Y! = E- yXE [ufui] - pE [ufu;] E + pE [|ui||2utui] (24.2) 
Ew; = (I — pR,) Eb; 1 
Vector Notation 


The diag{-} notation allowed us to replace an M x M matrix by an M x 1 column vector 
whose entries were the diagonal entries of the matrix. More generally, we shall use the 
vec{-} notation to replace an M x M arbitrary matrix by an M? x 1 column vector whose 
entries are formed by stacking the successive columns of the matrix on top of each other. 
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with c denoting the vectorized version of ©. Likewise, we shall write r to denote the 
vectorized version of R,,, 


r = vec(R,) (r is M? x1) (24.4) 
and r’ to denote to the vectorized version of RI, 
r' = vec( RI) (24.5) 


When the regressors are real-valued, so that R, = RI, the vectors {r,r’} will coincide. 
However, when the regressors are complex-valued, we need to distinguish between r and 
r’. 

In addition, we shall use the notation vec! (-) to recover a matrix from its vec repre- 
sentation. Thus, writing vec! (a) for an M? x 1 column vector a, results in an M x M 
matrix whose entries are obtained by unstacking the elements of a. This choice of notation 
is in contrast to the diag{-} operation, which is generally accepted as a two-directional op- 
eration: it maps diagonal matrices to vectors and vectors into diagonal matrices. Therefore, 
we shall write 


Ez-vec (c) and R,=vecl{r} (24.6) 


to recover (X, Ru} from (o, r). 


Kronecker Products 

The vec{-} notation is most convenient when working with Kronecker products (see Sec. B.7) 
The Kronecker product of two matrices A and B, say, of dimensions m, x na and m; x n, 
respectively, is denoted by A & B and is defined as the mamy x nan matrix 


aB aB alna B 
A a21B a2,2B Q2.n B 

AQB = i i he 
Qm,1B üm, oB ... Om n,B 


This operation has several useful properties (see Lemma B.8), but the one that most in- 
terests us here is the following. For any matrices (P, X, Q) of compatible dimensions, it 
holds that 


vec(PXQ) = (Q™ & P)vec(X) (24.7) 


This property tells us how the vec of the product of three matrices is related to the vec of 
the center matrix. 


Linear Vector Relation 


With the above notations, we can now verify that expression (24.2) for X’ in terms of © still 
amounts to a linear relation between the corresponding vectors (oc, o’}; just like it was the 


case for (X, X) in (23.17). Indeed, applying (24.7) to some of the terms in the expression 
for X' in (24.2) we find that 


ve((XE[ufu]) = ([Eulw]'&Iu)e = (RI &Im)o 
vec(E[u;u]3) = (Iu G[Eu;uij)s = (im 8 R,)c 
vec(E|u;|2utu;) = vec(E[ufu;Eufu]) = E (luu @ (uui) o 


Taking the vec of both sides of (24.2), and using the above equalities, we find that the 
weighting vectors (o,c') satisfy a relation similar to (23.21), albeit one that is M?- 
dimensional, 

d'= Fo (24.8) 


where F is M? x M? and given by 


F Ê Im — llm @ Ry) - KRI SIm) + pPE((utu] &[utu]) | (24.9) 
or, more compactly, in factored form: 
FÊE [ae - pužu)" (Im - uuu) (24.10) 


Variance Relation 


We can further rewrite recursion (24.2) for E ||»; ||3 by using the weighting vectors (o, c") 
instead of the matrices (3, £'). Using (24.8) and the notation (24.6), recursion (24.2) 
becomes 


E |l ect) m E ||; ill vec-i cro] + uer" a) 
where, for the last term, we used the fact that E||u;|2. = Tr(R,X) = r/Tc. For com- 


pactness of notation, we shall drop the vec ! (-) notation from the subscripts and keep the 
vectors, so that the above can be rewritten more compactly as 


E[vil2 = Ellbiils + ior o) Q4.11) 


The vector weighting factors (c, Fo} in this expression should be understood as com- 
pact representations for the actual weighting matrices (vec! (c), vec" ! (Fo). In other 
words, if ø is any column vector, the compact notation ||z||2 denotes 


lel? Ê lælĝec-1{0} = z'Ez, where Ec vec7(0) 


In summary, starting from (24.2), we argued that the weighting matrices (2, X) can 
be vectorized, so that (24.2) can be equivalently expressed more compactly as in (24.8)- 
(24.10) and (24.11), namely, 


Eje? = Elit. + £o" o) (24.12) 
o = Fo (24.13) 

F = Im- ulim @ R,) - (RI GI) + pE ([utu)’ @ [ufuj]) (24.14) 

Ew, = (I-puR,) EÙ; (24.15) 
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24.2 MEAN-SQUARE STABILITY AND PERFORMANCE 


Although the coefficient matrix in the recursion for Ew, is (I — pRa), this recursion is 
equivalent under the unitary transformation (22.31) to (23.23), so that the same condition 
on u from (23.24) guarantees convergence in the mean, i.e., 


u< 2/Àmax (24.16) 


guarantees E w; — w°. 

In addition, the mean-square behavior of LMS in the non-Gaussian case is characterized 
by Eqs. (24.12)-(24.15) in a manner that is similar to equations (23.20)-(23.23) in the 
Gaussian case. The main difference is the dimension of the variables involved. In the non- 
Gaussian case, the vector c is M? x 1 and the matrix F is M? x M?. Apart from this 
difference, all other arguments that were employed before regarding mean-square stability, 
excess mean-square error, and mean-square deviation extend almost literally to the present 
case. In so doing, we will be able to conclude that the mean-square behavior of LMS is 
now characterized by the following M?-dimensional state-space recursion (see Prob. V.7): 


Wi = FW;-1 + i2o2y (24.17) 


where F is the companion matrix 


0 1 
0 0 1 
0 0 0 1 
F= : (M? x M?) (24.18) 
0 0 0 1 
—Po -Pı —P2 —PM?-1 


with 
2 M?-1 
p(z) & det(zI - F) 2 z^ + Y o» 
k=0 


denoting the characteristic polynomial of F in (24.14). Also, W; and are now the M? x 1 
vectors 


E || w,||2 r'To 
aAA r" Fa 

w, & | Ellwillie. ' y=| qe (24.19) 
Eli;l2. uai, rTpM'-1g 


for any c of interest, e.g., most commonly, c = qor o =r. 
We still need to specify the condition on u for mean-square stability. To do so, we start 
by expressing the matrix F in (24.14) in the form 
F -Iys —- nA i? B (24.20) 
with Hermitian matrices {A, B) given by 


A=(Iy@R,) + (Rolm), B-E (luzu) g (uy) (24.21) 


Actually, A is positive-definite and B is nonnegative-definite. We shall assume that the 
distribution of the regression sequence is such that B is a finite matrix. For mean-square 
stability we need to find conditions on yz in order to guarantee —1 < A(F) < 1. How- 
ever, contrary to the Gaussian case in (23.34), the matrix F is no longer guaranteed to be 
nonnegative-definite in general — see Prob. V.4 for an example. In this way, the result 
of Prob. V.3, which was used in the Gaussian case, is not immediately applicable since 
it assumed F > 0. The condition given in that problem, namely,  < 1/Amax(A71B) 
guarantees A(F) < 1. A second condition on p is needed to enforce —1 < A(F). It is 
shown in App. 25.A that the matrix F will be stable for values of y in the range: 


; 1 1 
a< p< min erg, Dew] (24.22) 


where the second condition is in terms of the largest positive real eigenvalue of the follow- 
ing block matrix, 


we | T 2 z 2 | (2M x 2M) (24.23) 
M2 


when it exists. If H does not have any real positive eigenvalue, then the corresponding 
condition is removed from (24.22) and we only require  < 1/Amax(A7!B). Conditions 
(24.16) and (24.22) can be grouped together into a single condition as follows: 


as Amax( A71 B)' max (A(H) € R*) 


2 1 
0cu < min 5 — pm (24.24) 


Theorem 24.1 (Stability of LMS for non-Gaussian regressors) Assume the 
data (d(i), uj) satisfy model (22.1) and the independence assumption (22.23). 
The regressors need not be Gaussian. Then the LMS filter (23.1) is conver- 
gent in the mean and is mean-square stable if the step-size u is chosen to 
satisfy (24.24), where the matrices A and B are defined by (24.21) and B 
is finite. Moreover, the transient behavior of LMS is characterized by the 
M?-dimensional state-space recursion (24.17)-(24.19), and the mean-square 
deviation and excess mean-square error are given by 


MSD = ,4?c2r"(I— F)q, | EMSE = po?r (1— F)71r 


where r = vec(R,), r’ = vec( RI), and q = vec(I). 
lILIL——— ———É——————AM er eie e eir EET tnis eee | 


The expressions for the MSD and EMSE in the statement of the theorem are derived in 
a manner similar to (23.51) and (23.56). They can be rewritten as 


MSD = j^c?E|uilü p, and EMSE = p?02E|luill%_p)-2, 
or, equivalently, as 


MSD = j?o2Tr(R,YXgs) and EMSE = j2c2Tr(R,Xense) 


where {E msd, Xemse) are the weighting matrices that correspond to the vectors Omsa = 
(I— F)~1q and Cemse = (I— F)~1r. That is, 


Emsa = VEC à (Omsa) and — Lemse = vec! (Temse) 
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As a side remark, it is immediate to verify, using (I— F) = pA — u? B and the expressions 
for (A, B) from (24.21), that the (Xa, Uemse } so defined satisfy the equations 


I 
Rau 


I( E X msa T Emsa Ru) e pE (uz ui Emst; ui) 
L(RyZemse + Xemse Ru) = pE (uj uiXemset; ui) 


24.3 SMALL STEP-SIZE APPROXIMATIONS 


While the matrix A in (24.21) is readily available, the difficulty in the non-Gaussian case 
lies in evaluating the matrix B (which involves fourth-order moments). Only in special 
cases we may be able to evaluate B in closed form (see, e.g., Prob. V.15). In general, for 
arbitrary distributions of u;, it may not be possible to evaluate B. Still, it can be proved 
that as long as B is a finite matrix (i.e., as long as the distribution of the regressors is 
such that the corresponding matrix B of fourth-order moments is finite), then there always 
exists a small enough 4 that satisfies condition (24.24). In other words, under the data 
conditions (22.1) and (22.23), the LMS filter can be guaranteed to be mean-square stable 
for sufficiently small step-sizes, regardless of the distribution of the regressors. This fact is 
established in Prob. V.38. 

Moreover, observe that the expressions for the EMSE and MSD that are given in the 
theorem need not be easy to evaluate in general since they are defined in terms of F, which 
requires knowledge of B. However, since the fourth-order moment E ||u;||2.uf u;, which 
is part of expression (24.2) for =’, appears multiplied by 4? in (24.2), its effect can be 
ignored if the step-size is sufficiently small. In this way, the variance relation (24.2) would 

E|i-il£, + pose uil 


reduce to 
Ed; 
Y X - UXE [u?u;] — pE [ufui] X 


This relation can then be used to derive simplified expressions for the EMSE and MSD of 
LMS for Gaussian and non-Gaussian regressors, as done in Prob. V.14 where it is shown 
that 


(24.25) 


EMSE = uc2Tr(R,)/2 and MSD = uc2M/2 (small p) 


This expression for the EMSE is the same one we derived earlier in Lemma 16.1. 


; 

Lemma 24.1 (Performance of LMS) Consider the LMS recursion (23.1) and 
assume the data (d(i), u;) satisfy model (22.1) and the independence as- 
sumption (22.23). Assume further that the matrix B from (24.21) is bounded. 
Then there always exists a small enough u such that the LMS filter is mean- 
square stable. Moreover, it holds that for such sufficiently small step-sizes: 


EMSE © uc2Tr(R,)/2 and MSD « uo2M/2 


Learning Curve 

Finally, since Eje;(i)? = E|/t:-1||%,, we find that the evolution of E |e;(1)|? is de- 
scribed by the top entry of the state vector W; in (24.17)-(24.19) with o chosen as ø = r. 
In Probs. V.9 and V.10, we derive a recursion for E |ea(i)|?. Specifically, we show that 


Ele,G)? = Eles(i — 1)? + w*(Ai-1 — Ai-2)w? + p?o2Tr(RA;-2), i21 


with initial condition E|e,(0)|? = w?* R,w? (assuming w_; = 0 so that b. = w°), 
and where the matrix A; is computed via A; = vec 1(a;), where a; = Fa;_, with initial 
conditions a_; = r and A_; = Ra. The learning curve of the filter is then given by 
Ele(i)? = e? + Elea(i)|?. 


24.A APPENDIX: AVERAGING ANALYSIS 


The independence assumption (22.23) is a widely used condition in the transient analysis of adaptive 
filters. Although not valid in general, because of the tapped-delay-line structure of the regression 
data in most filter implementations, its value lies in the simplifications it introduces into the analysis. 
Without the independence assumption, the transient, convergence, and stability analyses of adaptive 
filters can become highly challenging, even for the simplest of algorithms. Fortunately, there are 
results in the literature that show, in one way or another, that performance results that are obtained 
using the independence assumption are reasonably close to actual filter performance when the step- 
size is sufficiently small. The purpose of this appendix is threefold: (i) to describe some of these 
results, (ii) to give an overview of averaging analysis, which is a useful method of filter analysis 
without the independence assumption (albeit one that assumes infinitesimally small step-sizes), and 
(iii) to combine energy-conservation arguments with averaging analysis in order to show that the 
results obtained from an averaging analysis are essentially identical to those obtained from an inde- 
pendence analysis. 


N -Dependent Regressors 


To begin with, one of the earliest results on the accuracy of the independence analysis assumes 
tapped-delay-line regressors of the form u; = [u(i) u(i — 1) ... u(i — M + 1)], with an input 
sequence (u(i)) that is generated by passing an i.i.d. binary sequence (s(k)) through an FIR filter 


of length L. Specifically, 
L-1 


u(i) = V, h(k)s(i — k) (24.26) 
k=0 


where {h(k)} denotes the impulse response sequence of the filter. This model is adequate, for 
example, for equalization applications with BPSK modulation, in which case {h(k)} refers to the 
channel impulse response. A useful property of the model (24.26) with i.i.d. (s(k)) is that two 
regressors {u;, u;) are truly independent if the time difference |i — j| exceeds the sum M + L. 

Using model (24.26), it has been shown by Mazo (1979), in the case of LMS, that results ob- 
tained from the independence theory are reasonable approximations for the actual filter performance 
for small step-sizes. The main conclusion is the following (compare the expression for the EMSE in 
the statement of the theorem below with the one given by (23.58)). 


Theorem 24.2 (Binary model) Assume the input sequence {u(i)} is generated as 
in (24.26) with an i.i.d. binary sequence {s(k)}. Assume further that the reference 
sequence (d(i)) is modeled as in (22.1) with Ru > 0. Then it holds that the actual 
EMSE of LMS is given by 


EMSE = jim Eles( = po2Tr(Ru)/2 + Olp?) 


There exists a generalization of the above result that allows for input sequences {s(k)} that are 
not restricted to a finite alphabet. Instead, (s(k)) is allowed to be an arbitrary i.i.d. sequence — 
compare again the result below with (23.58) and (23.23). 
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Theorem 24.3 (i.i.d. model) The same conclusion of Thm. 24.2 holds for arbi- 
trary i.i.d. sequences {s(k)}. Moreover, E Ù; converges to its limiting value ap- 


proximately as a linear system with modes at (1 — àx}, where the A, are the 
eigenvalues of Ry. 


Theorem 24.3 is due to Jones, Cavin, and Reed (1982), and it can be established under the weaker 
condition of a noise sequence (v(i)) that is not required to be i.i.d. or even independent of the 
regressors (ux). However, for our purposes here, it is sufficient to state the result, as above, for 
noise sequences (v(i)) that satisfy model (22.1). 

The arguments that establish Thms. 24.2—24.3 assume that the step-size is sufficiently small to 
guarantee filter convergence, and subsequently they establish that the resulting steady-state perfor- 
mance can be predicted reasonably well by independence theory. The next result by Macchi and 
Eweda (1983) goes a step further and shows that a sufficiently small step-size does exist to guaran- 
tee filter convergence even when the regressors are not independent. The result relaxes assumption 
(24.26) and assumes that the extended sequence {d(z), u:i} is such that 


{. tt) (d(i = 1), Ui-1); (d(i),ui)) and ((d( +i), UWitj), (d(i T j T 1), Uixj4i) 27 J 


are mutually independent whenever j > N for some integer N > M/12. Sequences satisfying this 
property are said to be N dependent. 


Theorem 24.4 (N -dependent regressors) Assume the sequence ((d(i), ui)} is 
N —dependent for some integer N > M/12, and assume further that the moments 
of (d(i),u;) are bounded in the following manner: 


E (Iis) «oo foralln «12 and E COME f) < 00 


with Ru > 0. Then there exists a pair (uo, 2) of positive real numbers such that 
the weight-error vector Ù; of LMS satisfies 


lim sup E |i; ||? < Bu forallstep-sizes p < po 
ico 


ore 


Averaging Analysis 

Theorems 24.2-24.4, and the associated arguments and derivations, are specific to LMS. Similar 
results for a wider class of adaptive filters can be obtained by appealing to a method of analysis 
known as averaging analysis. In this framework, the regressor sequence {u;} is not required to be 
N-dependent anymore; instead it is required to satisfy a certain mixing condition, namely, that the 
correlation between uj and u; “dies out” as the time difference |i — j| increases. The regressor 
sequence {u;} is also required to be bounded, i.e., there should exist 8 < oo such that 


sup ||u:||? € 8 « 2/u with probability 1 
i»0 


Note that this boundedness condition is not satisfied for some important input distributions, e.g., 
LMS with Gaussian distributed regressors. Averaging theory further requires the step-size to be 
vanishingly small. Nevertheless, and unlike the previous results, the theory applies to a larger class 
of adaptive algorithms, with little modification in the basic theorems. 

To describe the mixing condition that is imposed on the regressors {ui} we need the following 
definition (Durrett (1996)). 


— auaa a l EMN 
Definition 24.1 (Uniform-mixing processes) A random process £, is said to bea 
uniform- (or ¢-) mixing process if there exists a sequence ó(n) satisfying lim,—oo ó(n) 
= 0 such that 


sup |Prob(£,,,, = a£; = b) — Prob(é;,,, —a)| &ó(n) forall in 
a,b 


That is, the variables £,,,, and €; become essentially independent as the time 
difference |n| grows; the notation Prob(£ = a) denotes the probability of the event 


€=a. 


Examples of ó-mixing processes are N-dependent processes, processes generated from bounded 
white-noise filtered through a stable finite-dimensional linear filter, as well as purely nondeterministic 
processes. Recall that a process (£;) is said to be purely nondeterministic if, for any i, £; cannot be 
perfectly predicted from the previous elements (£; ,,£; 5,..., €; ,) for any k. In other words, 
if ê, is an estimator for £, given the past values(£; ,,£; 5,...,€; ,], then there is a constant 
c? > such that var(é, —£;) > o? > Ü even when i — oc. Here, var(-) denotes the variance of 
its argument. 


Now consider a general adaptive filter weight-error update of the form 


Ù; = Wi-1+ uf (i,Wi-1) with initial condition w_1 (24.27) 


The function f is stochastic, i.e., for every i and ib; 1, f (i, 0i 1) is a random vector. For example, 
in the LMS case we have 
z X NES . 
Ùi = Wi-1 — pu; [uiw + v(i)] 
so that 
ek Saas aps 
fi, Wi-1) = -uiuibi-i — uiv(i) 


Define the averaged function fav, 


fav(i, @i-1) 2 Ef(i iii) (24.28) 


where tv;—1 is considered constant for the computation of the expected value. Again, for LMS, the 
averaged function is given by 


fav Ms (i, @i-1) = E (Cut uito; i — ujv(i)) = Rubia 


where we are assuming that the data {d(i), wi} satisfy model (22.1). Define further the averaged 
system: 
d = WT +ufali,wei), DI = Ù- (24.29) 


where the stochastic function f(-,-) in (24.27) is replaced by its averaged value and, accordingly, 
the corresponding weight-error vectors are denoted by w?*. Again, for LMS, for example, we have 


Bovis = (I a y, R, dip? MS 


It is clear from this simple example that an averaged system is not useful to estimate the steady-state 
performance of an adaptive filter (for instance, the noise information is lost). To do so, it is helpful 
to consider instead a partially averaged system, defined as 


$^" = [I + UVa fa. (0)] ©? + ulf (50) — favli, 0)] (24.30) 


where Vo fav(0) denotes the value of the gradient of fa, (i, 4b) with respect to w evaluated at the 
origin. Using the LMS algorithm again as an example, we have 


Vafa(0)--R, an f(i,0) - fa.(i,0) = —uiv(i) 
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so that 

Prins = (1— pRu) wey MS — puz oli) 
The following result from Solo and Kong (1995, Chapter 9) now states that if the step-size u is suffi- 
ciently small, and for fairly general adaptive schemes, the original weight-error vector Ù; of (24.27) 
remains close to the partially averaged weight-error vector 7°” of (24.30), and that the steady-state 


covariance matrix of tv; is close to that of 167^" as well. 


Theorem 24.5 (Averaging result) Consider a weight-error recursion of the form 
(24.27) and its averaged forms (24.29) and (24.30), with regressor sequence (ui) 
assumed to be uniform mixing. Assume further that the following conditions hold: 


1. The origin, 0, is an exponentially-stable equilibrium point of the averaged 
system (24.29) with decay rate O(y). 


2. The gradient vector Va fa, (i, Ù) exists and is continuous at the origin. 


3. There exists a positive constant c such that, for any vectors a and b, the 
following so-called Lipschitz condition holds: || Va f(i,a) — Va f(i,b)| < 
ella - ||. 


Under these conditions, Ù; and w?°” obtained from (24.27) and (24.30) satisfy 


, a > TE S 5 2 
lim sup E ||; - @7°"|| = 0, lim lim EJ; — &?” ||" =0 
A0 i»0 p70 i—oo 

NET T RS TET l E~ pav s pav 
— i H m — d Ü 
lim lim Ed; lim lim EG wi 
L—0i—cc \ u U0 i~ H 


In view of the above statement, and for sufficiently small step-sizes, we may evaluate the perfor- 
mance of an adaptive filter by studying the performance of its partially-averaged system (24.30). For 
example, for nondeterministic regressors {u;}, and assuming the noise sequence {v(i)} satisfies 
model (22.1), the steady-state EMSE performance of LMS can again be deduced to be 


EMSE = lim. Elea(i)|? = po2Tr(Ru)/2 + O(n?) 


as we verify next in greater detail. 


Energy Conservation and Averaging Analysis 


We can combine energy-conservation arguments with averaging analysis in order to evaluate the 
performance of adaptive filters. For brevity, we shall focus only on LMS and NLMS for which the 
functions’ f (i, 3b; 1), Va fov (0), and (f(i,0) — fav(i, 0)) are listed in Table 24.1, where we are 
also introducing the stochastic function f A (i, 0) defined via the identity 


f(i,0) = fav (i, 0) = uifali, 0) 
With this definition, the partially-averaged system (24.30) becomes 
we? = [I+ UV ó fav (00? + pul f A (7,0) (24.31) 


We are now in a position to study the steady-state and transient behavior of such systems. 


Steady-state analysis. Assume the data (d(i), wi} satisfy model (22.1) and let e(i) = d(i) — 
uiwi—ı denote the output error at time 7. In steady-state analysis, we are interested in evaluating 


7For a discussion on a larger class of adaptive filters, including LMF, LMMN, and sign-error LMS, see Shin, 
Sayed, and Song (2004b). 


TABLE 24.1 IT of functions useful in the averaging analysis of LMS and NLMS. 
f (i wi- | fiiia) 0 | | Va fav(0) | f(4,0)— | (40) - fav(i, 0) | 0) | faio) | 0) 


NMS Ae 2 ui mG 
us ||? ui "d "i|? lui? 


the steady-state variance of e(i), which translates into the filter mean-square error, i.e, MSE = 
Ele(oc)|?. To evaluate the MSE, we shall rely on the same arguments we used in this part and 
also in Part IV (Mean-Square Performance). Let 5 denote any M x M Hermitian positive-definite 
matrix, and introduce the weighted a priori and a posteriori errors 


eE() Ê wEl-4Vafa(0)9?*, 2 (i) È wga?” (24.32) 
If we multiply both sides of (24.31) by u;X from the left we find that 
eZ(i) = eZ (i) + wuidut f a (i, 0) (24.33) 


Solving for f A (i, 0) and substituting into (24.31), we arrive at 


ut 


i pav = 
pp O = IL nVo foo (ORT + Figen O 


By equating the weighted norms on both sides of this equation, we conclude that the following energy 
equality should hold: 


ZG)? = larl + A” (A) - len (A) (24.34) 


where 


E'£ [Le uV fes (0] E[L + n Va fa (0)] 


and f= (i) = (||ul|Z)!. This result is the extension of the energy relation of Thm. 22.1 to the 
partially-averaged system (24.31). 

Replacing ez (i) in (24.34) by its equivalent expression (24.33), taking expectations of both sides, 
and using E f A (i, 0) = 0, we get 


E|£^"| = E meals +p ELIS ali 0)? eilg ] (24.35) 
For LMS, we have 
E[|fa (60)? iul] = Elfa (i0)? -Elluil& = cE lulls 
while for NLMS, we shall use the approximation 
(£6) Jule] = ote (H) x gop Ele 
In this way, the variance relation (24.35) will be replaced by 


Elg?” l? = E&i lli + uE || well (24.36) 


367 


SECTION 24.A 
AVERAGING 
ANALYSIS 


368 


CHAPTER 24 
LMS WITH 


NON-GAUSSIAN 
REGRESSORS 


where the positive scalar is defined by 


2s 


y= c (for LMS), y= [rg 


(for NLMS) (24.37) 


For other adaptive filters, we would proceed with (24.35). 

The subsequent analysis can be simplified by appealing to a change of coordinates. Thus, let 
Ra = UAU” denote the eigen-decomposition of Ru, with U unitary, and introduce the transformed 
quantities: 


D S YOY, Vaf.(0)5U'Vafa(0U, T: wU, EVU 
Then the variance relation (24.36) retains a similar form: 


E |f?" li = E ates + oP HE ills (24.38) 


with 


EX =F + woVaf a, (0) +p? (Vafav(0)]” BV a fav (0) (24.39) 


Relation (24.38) can now be used to deduce an expression for the filter MSE as follows. Since, 
by Thm. 24.5, 4b; and w?°” stay close to each other for small step-sizes, we assume u;Wj-1 © 
ui 07^1, so that the filter excess-mean-square error, which is defined by 


EMSE = lim E |u;i»; 1]? 
1—0c 
can be approximated via 


EMSE ~ lim E luii? = Jim E ||P i = Jim E [u^] Iå 
where the second equality assumes, in view of (24.31), that u; and t?° are essentially independent 
for infinitesimally small step-sizes. Taking the limit of (24.38) as i — = and using the steady-state 
condition E |[w?^" || = E ||w?^1||5, we obtain 


E m72; 


I -2XVaf.. (0)-u[Va fav (0)]* EVa Fav (0) 


= wyTr(AX) (24.40) 


In order to evaluate the EMSE, we need to select 5 such that the weighting matrix on the left-hand 


side becomes A E 
—25V o fav (0) - [Vahey (0) EV a f.,(0) =A (24.41) 


LMS. Using Va fa, (0) = —A and y = o$, we have that X should be selected to satisfy 25A — 
HAXA = A, that is, 
= (21 — pA) 


Then the left-hand-side of (24.40) becomes the filter EMSE and (24.40) leads to 


CMS poe Tr(A(21 — wA)~') = po? Y 


i=0 


2— x 


For small step-sizes, this expression reduces to (* 5 = uo2Tr(R,)/2, which is the same expression 
we derived for small step-sizes in Sec. 16. 


NLMS. Using Vafa, (0) = —A/Tr(Ru) and y = o2/[Tr(R.)]?, then X should be chosen to 
satisfy 
A ASA 


2 Yen, me ^ 
that is, 


= (2Tr(Ru)I — pA)! (Tr(R)]? 


Then the left-hand-side of (24.40) becomes the filter EMSE and it leads to 


NLMS a O BMN 
CMS — ug? Tr(A[2Tr(Ru)I — pA) S: FRU) AX 


For small step-sizes, this expression reduces to ("V5 = ,,52/2, which is the same expression we 
P P. p! 
derived for small step-sizes in Sec. 17.1. 


Transient analysis. The same variance relation (24.38) can be used to characterize the transient 
behavior of the filters, in addition to the following recursion for the mean weight-error vector (which 
follows by taking expectations of both sides of (24.31)): 


Ew?” = [+ uVa fa (0)E i727 (24.42) 


Observe that ©" in (24.39) will be diagonal if È is. In this way, rather than propagate the diagonal 
weighting matrices, it is more convenient to rewrite the recursion for Y/ in terms of its diagonal 
entries as 

v =Fo (24.43) 


where c = diag{}, 7 = diag(X:), and À = diag{A}, while the M x M matrix F is defined by 


nie 2 5 = (I- pA)? for LMS 
FA Ve 0n? = 24.44 
[I+ n Va f... (0)) l x —(I- pA/Tr(A))? for NLMS ( ) 
In this way, the recursion for E |^" ||2 from (24.38) can be rewritten as 
Em a = E ttis + WATS en 


This result extends recursion (25.14) to the partially-averaged system (24.30), except that now the 
matrix F is M x M. We are therefore led to the statement of Thm. 24.6 below, which is justified by 
the same arguments used in Sec. 25.2 further ahead. 

Note that the learning curve of either filter, for infinitesimally small step-sizes, can be estimated 
by evaluating E |u:ti—1|? = E |W? lh, = E ||w?*1 ||, which can be obtained from the top entry 
of the state vector W; for the choice F = A. It can be verified by experimentation that the learning 
curve generated in this manner, by means of averaging analysis, agrees well with the learning curve 
generated from the independence analysis in the body of the chapter, and that both theoretical curves 
match well with experimental results —see Fig. 24.1. 


Mean-square deviation. We can also use the variance relation (24.45) to shed further light on the 
mean-square performance of the filters. In particular, for each filter, we can re-examine its EMSE as 
well as its mean-square deviation (MSD), which is defined as 


MSD = lim E |jàl? 
1-00 


Indeed, assuming the step-size u is chosen to guarantee filter stability, recursion (24.45) becomes in 


steady-state 
E ||" me = (AS) (24.46) 


If we now select & as the solution to the linear system of equations (I — F)g = vec{I}, then the 
weighting quantity that appears in (24.46) reduces to the vector of unit entries. In this way, the 
left-hand side of (24.46) becomes the filter MSD and (24.46) leads to 


MSD = p?yX" (I — F)~' vec{I} (24.47) 
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Theorem 24.6 (Transient behavior by averaging analysis) Assume (d(i). ui) 
satisfy model (22.1) and consider the LMS and NLMS algorithms. Assume further 
that the conditions of Thm. 24.5 hold. Then the transient behavior of either filter is 
described by the M —dimensional model W; = FWi-1 + u^», where y is defined 
by (24.37) and 


0 1 
0 0 1 
0 0 0 1 
F=] . (M x M) 
0 0 0 1 
—po —p —p2 ... ~PM~-1 
with p(x) £ det(zI — F) = zM + YM-! psx" denoting the characteristic poly- 


nomial of F in (24.44). Also, 


Ew?" M 
ED 2, \"Fo 
Wi = : i X = : 
E|w?*" [E ATP? 
El?" [ss ATE Ig 


[mcis 40. cho. Qnm AA. e oeeeq looo el ree ne d 


Learning curves obtained by simulation, by theory using the independence assumption, 
and by theory using averaging analysis. 


Simulation 
— Theory (using averaging) 
1: Theory (using independence; 


MSE (dB) 


Mt Np a NISI TRENT n qr t RN S SUSPEN, 
0 200 400 600 800 1000 1200 1400 1600 1800 2000 
Iteration 


FIGURE 24.1 Theoretical and simulated learning curves for a 10—tap LMS implementation 
using p = 0.005 and Gaussian input with SNR set at 30 dB. The theoretical curves are 
evaluated with and without the independence assumption (22.23): in one case we use the 
result of Thm. 24.6, which relies on averaging, and in the other case we rely on the results of 
Chapter 24 under independence. 


CHAPTER 2 5 


Data-Normalized Filters 


W. extend the transient analysis of the earlier chapters to more general filter recursions, 
starting with the normalized LMS algorithm. 


25.1 NLMS FILTER 


We thus consider the e-NLMS recursion 


* 


uš : 
Wi = Wi- + Bryja” (25.1) 


for which the data normalization in (22.2) is given by 
glui] = € + lul? (25.2) 


In this case, relations (22.26)-(22.27) and (22.29) become 
LA | 


E lül? = Eli i], + u?o2E 
w lis: || w il T MO, s 


Uru; uiui TAE : 
snum en] [be oe 


e+ lul? e+ + wil?) 


and we see that we need to evaluate the moments 


e ’ E € + [iu]? ' E (e+ ]u;122 i (25.3) 


Unfortunately, closed form expressions for these moments are not available in general, 
even for Gaussian regressors. Still, we will be able to show that the filter is convergent in 
the mean and is also mean-square stable for step-sizes satisfying 4 < 2, and regardless 
of the input distribution (Gaussian or otherwise) — see App. 25.B. We therefore treat the 
general case directly. Since the arguments are similar to those in Chapter 24 for LMS, we 
shall be brief. 

Thus, introduce the M? x 1 vectors 
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as well as the M? x M? matrices 
unm M Uu; Ui 
E(—+—.} ell + ree (ie) 25.4 
| e) | | e+ us? ee 
* T * 
"uui uU; Uj; 
E —À—— o (tita) 25.5 
E tite) oa 


A Uu; 
dij ee 


lip 


A 


(> 


B 


and the M x M matrix 


The matrix A is positive-definite, while B is nonnegative-definite — see Prob. V.6. Apply- 
ing the vec notation to both sides of the expression for X’ in (25.3) we find that it reduces 
to 

e! = Fo (25.7) 


where F is M? x M? and given by 
4 2 
F=1-pA+p‘°B (25.8) 
Moreover, the recursion for E 4»; can be written as 
Ew, = (I = uP) Edi. (25.9) 


The same arguments that were used in Sec. 24.1 will show that the mean-square behavior 
of e—NLMS is characterized by the following M?-dimensional state-space model: 


Wi = FW;i-1 + wozy (25.10) 


where F is the companion matrix 


0 1 
0 0 1 
0 0 0 1 
F=| . (M? x M?) 
0 0 0 1 
—po -Pı ~P2 —PM?-1 


with 
2 M?-1 
p(w) Ê det(zI- F) =a” + Y pua 
k=0 


denoting the characteristic polynomial of F in (25.8), W; is the M? x 1 state vector 
E ||w||2 


E [às], 
E [bill zo Q5.11) 


lI 


E ||| 


2 
FMP -1)g 


and the k-th entry of is given by 


ee ar 

EY), = EER , k=0,1,...,M?-1 (25.12) 
EX = E [Gs fu? 

The definitions of {W;, Y) are in terms of any c of interest, e.g., most commonly, o = q 


orc = r. Itis shown in App. 25.B that any  < 2 is a sufficient condition for the stability 
of (25.10). 


Theorem 25.1 (Stability of e-NLMS) Consider the e—NLMS recursion (25.1) 
and assume the data {d(i), u;} satisfy model (22.1) and the independence 
assumption (22.23). The regressors need not be Gaussian. Then the filter 
is convergent in the mean and is also mean-square stable for any p < 2. 
Moreover, the transient behavior of the filter is characterized by the state- 
space recursion (25.10)-(25.12), and the mean-square deviation and the ex- 
cess mean-square error are given by 


MSD = ,?c?E Isla mis EMSE = y2o2E Mili my 
Or Fe us? | T Bue | ex fud? 


where r = vec( Ru) and q = vec(I). 


The expressions for the MSD and EMSE in the statement of the theorem are derived in 
a manner similar to (23.51) and (23.56). They can be rewritten as 


MSD = yo2Tr(SEmsa) and EMSE = p?o2Tr(SDemse) 


sS E (rju) 


and the weighting matrices (Xa, Xemse) correspond to the vectors Omsa = (I — F)~!q 
and Gase = (I — F)~!r. That is, 


where 


X msd = vec ^! (esa) and Memse = vec! (Gemse) 


Learning Curve 

Observe that since E|e,(i)|? = E||tw:-1||2,, we again find that the time evolution of 
E|e,(2)|? is described by the top entry of the state vector W; in (25.10)-(25.12) with o 
chosen as c = r. The learning curve of the filter will be E |e()|? = o2 + E Je, (i)|?. 


Small Step-Size Approximations 

Several approximations for the EMSE and MSD expressions that appear in the above theo- 
rem are derived in Probs. V.18-V.20. The ultimate conclusion from these problems is that 
for small enough y and e, we get 


1 


2 2 

poy 1 A0; 
E = A Tr( Ry d = —— —— 
ee pee (raat) a) ee NSP = (ap) 


(25.13) 
The expression for the EMSE is the same we derived earlier in Lemma 17.1. 
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Gaussian Regressors 

If the regressors happen to be Gaussian, then it can be shown that the M?-dimensional 
state-space model (25.10)-(25.12) reduces to an M —dimensional model — this assertion 
is proved in Probs. V.16 and V.17. 


25.2 DATA-NORMALIZED FILTERS 


The arguments that were employed in the last two sections for LMS and e—NLMS are 
general enough and can be applied to adaptive filters with generic data nonlinearities of the 
form (22.2)-(22.3). To see this, consider again the variance and mean relations (22.26)- 
(22.27) and (22.29), which are reproduced below: 


Eli; = Elissa, + p20 E 
9? [us] 
už ui u*uj AE 
EO = D-E |a E] D+ pre | X ufus 
3 ESI á ES HE gh 
uiui z 
Ew; = I- pE - Ed 
( s sist] ! 


(25.14) 
If we now introduce the M? x M? matrices 


Kay. T *ap. 
«a (GBT) + Geel) eo 
A uiui T Ui Ui 
B = E{|— | @ (25.16) 
g{ui) g[ui] 
and the M x M matrix i 
PÊE ES (25.17) 


then 
Ew; = (I~ uP) EÙ; 


and the expression for X’ can be written in terms of the linear vector relation 


piu (25.18) 
where F is M? x M? and given by 
F $ I- uA B (25.19) 
Let 
H= | Ad 2 -B/2 | (2M? x 2M?) (25.20) 
xi. d 


Then the same arguments that were used in Chapter 24 will lead to the statement of 
Thm. 25.2 listed further ahead. The expressions for the MSD and EMSE in the state- 
ment of the theorem are derived in a manner similar to (23.51) and (23.56). They can be 
rewritten as 


MSD = p202Tr(SDmea) and  EMSE = p2o2Tr(SSemse) 
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and the weighting matrices (Xa, Xemse] correspond to the vectors amsa = (I— F)7!g FILTERS 


and Gemse = (I— F)~!r. That is, 


Emsa = vec™! (Omsd) and — Lemse = vec! (Cemse) 


Theorem 25.2 (Stability of data-normalized filters) Consider data normalized 
adaptive filters of the form (22.2)-(22.3), and assume the data (di), u;) sat- 
isfy model (22.1) and the independence assumption (22.23). Then the filter 
is convergent in the mean and is mean-square stable for step-sizes satisfying 


0 < u < min {2/Amax(P), 1/Amax(A7*B), 1/ max (A(H) € R*]) 


where the matrices ( A, B, P, H) are defined by (25.15)-(25.17) and (25.20) 
and B is assumed finite. Moreover, the transient behavior of the filter is 
characterized by the M?-dimensional state-space recursion W; = FW,_1 + 
u202y, where F is the companion matrix 


0 1 
0 0 1 
0 0 0 1 
F=|. (M? x M?) 
0 0 0 1 
—po —Pi -—D2 ... —PM?-1 


with 
M?-1 


p(x) & det(zl — F`) = aM? 4 >, prz” 
k=0 


denoting the characteristic polynomial of F in (25.19). Also, 


E ||, \|2 
E || illo juill? 
w, £ | Ellwilra, , Dk = | ee] k=0,...,M?-1 
] g? [ui] 
E liilli- 


for any c of interest, e.g., c = q oro =r. In addition, the mean-square 
deviation and the excess mean-square error are given by 


luill- mi 


MSD = p?o2E 
j | [ul 


ui ea -ip 


g? [ui] 


where r = vec(R,,) and q = vec(I). 
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CHARTER B8 As before, since E |ea(i)|? = E ||i»;.1||2, , we find that the time evolution of E |eq(i)|? is 
NORMALIZED described by the top entry of the state vector W; in with c chosen as c = r. The learning 
FILTERS curve of the filter will be E|e(2)|? = c2 + E [ea (i)!?. 


Small Step-Size Approximations 

In Prob. V.39 it is shown that under a boundedness requirement on the matrix B of fourth 
moments, data-normalized adaptive filters can be guaranteed to be mean-square stable for 
sufficiently small step-sizes. That is, there always exists a small-enough step-size that lies 
within the stability range described in Thm. 25.2. 

Now observe that the performance results of Thm. 25.2 are in terms of the moment 
matrices (A, B, P). These moments are generally hard to evaluate for arbitrary input 
distributions and data nonlinearities g[.]. However, some simplifications occur when the 
step-size is sufficiently small. This is because, in this case, we may ignore the quadratic 
term in ys that appears in the expression for X in (25.14), and thereby approximate the 
variance and mean relations by 


2 
Eli; = E õi + w2o2E (iii) 


g? 
Y 2X — XP — uPX (29:20) 
Ew; = (I j P) Ew,;_1 
where P is as in (25.17). Using the weighting vector notation, we can write 
= 1/2 ~= 2 2.2 KAF 
El: = Elticilk. + HoE zh] (25.22) 
F = I—-pA (25.23) 


where now 
A= (P' &I) - (L& P) 


The variance relation (25.22) would then lead to the following approximate expressions 
for the filter EMSE and MSD: 


EMSE = p202Tr(SZemse) ^ and MSD = p202Tr(SSmsa) 


where 
S — E (uju;/g?[ui)) 


and the weighting matrices {Demse, Emsa } correspond to the vectors emse = A~!vec(Ry,)/ pi 
and amsa = A` !vec(I)/u. That is, (Xemse; Xmsa) are the unique solutions of the Lya- 
punov equations 


LPYXgsa + uYgasP-—I and = pPLemse + ULemseP = Ry 


It is easy to verify that Emsa = ~+P~1/2 so that the performance expressions can be 
rewritten as 


EMSE = p:702Tr(SZemse); MSD = po2Tr(SP-')/2 


Remark 25.1 (Filters with error nonlinearities) There is more to say about the transient per- 
formance of adaptive filters, especially for filters with error nonlinearities in their update equations. 
This is a more challenging class of filters to study and their performance is examined in App. 9.C 


of Sayed (2003) by using the same energy-conservation arguments of this part. The derivation used 
in that appendix to study adaptive filters with error nonlinearities can also be used to provide an 
alternative simplified transient analysis for data-normalized filters. The derivation is based on a long 
filter assumption in order to justify a Gaussian condition on the distribution of the a priori error 
signal. Among other results, it is shown in App. 9.C of Sayed (2003) that the transient behavior of 
data-normalized filters can be approximated by an M dimensional linear time-invariant state-space 
model even for non-Gaussian regressors. Appendix 9.E of the same reference further examines the 
learning abilities of adaptive filters and shows, among other interesting results, that the learning 
behavior of LMS cannot be fully captured by relying solely on mean-square analysis! o 


25.A APPENDIX: STABILITY BOUND 


Consider a matrix F of the form F = I- u A+ u?B with A > 0, B > 0, and ys > 0. Matrices of this 
form arise frequently in the study of the mean-square stability of adaptive filters (see, e.g., (25.19)). 
The purpose of this section is to find conditions on y in terms of (4, B} in order to guarantee that 
all eigenvalues of F are strictly inside the unit circle, i.e., so that -1 < A(F) < 1. 
To begin with, in order to guarantee A(F) < 1, the step-size p should be such that (cf. the 

Rayleigh-Ritz characterization of eigenvalues from Sec. B.1): 

en z'(I- uA 4- ÀB)z < 1 

z|- 
or, equivalently, A — uB > 0. The argument in parts (b) and (c) of Prob. V.3 then show that this 


condition holds if, and only if, 
A. <1/Amax(A*B) (25.24) 


Moreover, in order to guarantee A(F) > —1, the step-size yz should be such that 


min, z'(I- uA +p’ B) > -1 
Ae 


or, equivalently, G() & 2-pA+p?B > 0. When u = 0, the eigenvalues of G are all 
positive and equal to 2. As u increases, the eigenvalues of G vary continuously with p. Indeed, the 
eigenvalues of G(x) are the roots of det[AI — G(z)] = 0. This is a polynomial equation in A and its 
coefficients are functions of pj. A fundamental result in function theory and matrix analysis states that 
the zeros of a polynomial depend continuously on its coefficients and, consequently, the eigenvalues 
of G(jz) vary continuously with yz. This means that G (u) will first become singular before becoming 
indefinite. For this reason, there is an upper bound on p, say, max, such that G(u) > 0 for all 
H < Himax. This bound on p is equal to the smallest value of u that makes G(x) singular, i.e., for 
which det[G(u)| = 0. Now note that the determinant of G(z) is equal to the determinant of the 
block matrix 


since 

X W 
det 
«( Y Z 


whenever Z is invertible. Moreover, since we can write 


fao A -B|_fao 10]  [ 4/2 -8/2 
rafa 9 ]-«| 4 ers eS ale 0 


we find that the condition det(K (jz)] = 0 is equivalent to det(I — pH) = 0, where 


D = det(Z)det(X - WZ^!Y) 


) 


HÊ A/2 -B/2 
z I 0 
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In this way, the smallest positive p that results in det[K (z)] = 0 is equal to 
1 


————————T 25.2 
max (A(H) € Rt} (23:25) 
in terms of the largest positive real eigenvalue of H when it exists. 
The results (25.24)-(25.25) can be grouped together to yield the condition 
0« « min zL 1 (25.26) 
B Amx(A-1B)' max (A(H) € R+} 


If H does not have any real positive eigenvalue, then the corresponding condition is removed and we 
only require pp < 1/Amax( A^ B). The result (25.26) is valid for general A > 0 and B > 0. The 
above derivation does not exploit any particular structure in the matrices A and B defined by (25.16). 


25.B APPENDIX: STABILITY OF NLMS 


The purpose of this appendix is to show that for eC NLMS, any up < 2 is sufficient to guarantee 
mean-square stability. Thus, refer again to the discussion in Sec. 25.1 and to the definitions of the 
matrices (A, B, P, F} in (25.4)-(25.8). We already know from the result in App. 25.A that stability 
in the mean and mean-square senses is guaranteed for step-sizes in the range 


2 1 1 
0« p< ming —_, ——À— —____ 
Pies { Ymax(P)’ Xmax(A~?B)’ max (A(H) € R*} | 
where the third condition is in terms of the largest positive real eigenvalue of the block matrix, 


HÊ A/2 -B/2 
Im2 0 


The first condition on u, namely, p < 2/Amax(P), guarantees convergence in the mean. The second 
condition on pz, namely, p < 1/Amax(A^!B), guarantees A(F) < 1. The last condition, p < 
1/ max (A(H) € R*+}, enforces \(F) > —1. The point now is that these conditions on jz are met 
by any u < 2 (ie., F is stable for any p < 2). This is because there are some important relations 
between the matrices (A, B, P) in the e—NLMS case. To see this, observe first that the term 


ului/(e + |ui?) (25.27) 


which appears in the expression (25.6) for P is generally a rank-one matrix (unless u; = 0); it has 
M — 1 zero eigenvalues and one possibly nonzero eigenvalue that is equal to? ||u;||?/(e + juill?) 
This eigenvalue is less than unity so that 


uiui 
Amax GERE) € 1 25.28 

o eg) 5 EU 
Now recalling the following Rayleigh-Ritz characterization of the maximum eigenvalue of any Her- 
mitian matrix R (from Sec. B.1): 


Amax(R) = max z'Rz (25.29) 


laii 


8Every rank-one matrix of the form zz*, where a is a column vector of size M, has M — 1 zero eigenvalues and 
one nonzero eigenvalue that is equal to ||z||?. 


we conclude from (25.28) that 


max (= en) <1 

Ìzl=1 e + [uil 
Applying the same characterization (25.29) to the matrix P in (25.6), and using the above inequality, 
we find that 


i> 


Amax(P) max z'Pz = max rE Esme = max E (r5 ) 


z 
lisi izle € + [eil]? e+ lull? 
< ı (25.30) 


In other words, the maximum eigenvalue of P is bounded by one, so that the condition u < 
2/ Amax(P) can be met by any p < 2. 

Let us now examine the condition y < 1/\max(A~!B). Using again the fact that the matrix in 
(25.27) has rank one, it is shown in Prob. V.8 that 


uiui T uiui uiui T uiui 
i (= ep) © (creas) | < (ep) ©} + fe (tp) 
(25.31) 
Taking expectations of both sides, and using the definitions (25.4)-(25.5) for A and B, we conclude 
that 2B — A < 0 so that the condition p < 1/Amax(A7}B) can be met by any p satisfying u < 2. 
What about the third condition on y in terms of the positive eigenvalues of the matrix H? It turns 


out that u < 2 is also sufficient since it already guarantees mean-square convergence of the filter, as 
can be seen from the following argument. Choosing X = I in the variance relation (25.3) we get 


=- 2 Elm. |2 2,2 llull? 
Elim]? = E |i- + p^ eiE [ts 
E'-I-2yP 4 pS 
where us|? 
a g| eel utu; 
se e [casi 
Obviously, S < P so that X’ < I — 2uP + p?P and, hence, 
java |? | 


E z 2 2 
Ele? < E |t; illt uPnP + KoE itis 


2 
= E[&Li(-24P +p’ P)bci] + po2E es 


Now from the result of part (a) of Prob. V.6 we know that Ru > 0 implies P > 0. We also know 
from (25.30) that Amax(P) < 1. Therefore, all the eigenvalues of P are positive and lie inside the 
open interval (0, 1). Moreover, over the interval 0 < u < 2, the following quadratic function of p, 


klu) È 1—29A + yr 
assumes values between 1 and 1 — A for each of the eigenvalues A of P. Therefore, it holds that 
1-2uP+p?P € [1 — 2u is (P) + i Ais (P)]I 
from which we conclude that 


PES -—— „2,2 luil? 
E[«;ll < aE|ibi-il + u“o,E (e+ [juil 


where the scalar coefficient a = 1 — 2uAmin (P) + u?Amin(P) is positive and strictly less than one 
for 0 < p < 2. It then follows that E ||Ū:||? remains bounded for all i. 
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Summary and Notes 


T: chapters in this part describe a procedure for evaluating the transient behavior of adaptive 
filters by relying on energy conservation arguments. 


SUMMARY OF MAIN RESULTS 


1. Consider adaptive filters that are described by data-normalized stochastic difference equations 
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of the form 


ul 


Wi = Wi- +H a Sut eli) = d(i) — uiwi-i 


where g[-] is some positive-valued function of u;. The mean-square error (MSE) of any such 
filter is defined as the limiting value of E |e(i)|?, 


MSE = lim Ele(i)|? 
1—0o 
Likewise, the mean-square deviation (MSD) of the filter is defined as the limiting value of 
E |[à.|l^, 
MSD = lim E||w;||? 
1—0o0 


where Ùw; = w° — w; and w° is the unknown vector in the data model (22.1). 


. For any such adaptive filter, and for any data (d(i), ui) and Hermitian nonnegative-definite 


weighting matrix X, the following weighted energy-conservation relation holds for all i: 


2 


lus -laili + lez (i)? = [uals dili + lesol 


where eZ (i) = u;Xdv; and e2 (i) = u;Ybi i. 


. The energy relation is useful i in several respects. For example, by taking expectations of both 


of its sides, expressing eZ (i) and eX (i)i in terms of t;, and by assuming the regressors (ui) 
to be independent and identically distributed, we arrive at the following weighted variance 
relation: 


Elea + ote (Ils 
t 
2 
uiui usui i 
X - uXE[(——)-yE(——)E€- USE usu) 
ee (o) ~ = Get) + i 


with a weighting matrix X’ for the time instant i — 1. The mean of the weight-error vector 
also satisfies the recursion 


4, The expression for X", and the recursion for E &b;, show that studying the transient behavior 381 


of an adaptive filter requires that we evaluate the three moments: Part V 
SUMMARY 
AND NOTES 
2 +o). .2 
E (Is) : E (i) and E (lè iu) 
gus] g[ui] g? [ui] 


These moments, and especially the third one, are in general hard to evaluate for generic func- 
tions g(-). Observe, however, that the third moment appears multiplied by u? in the expres- 
sion for X’. Therefore, if desired, its effect can be neglected for small-step sizes. This line of 
reasoning is pursued in some of the problems at the end of this part. 


5. In Sec. 25.2 we defined the M? x M? matrices F ÊI — pA + ,? B, 


*ull “us *a te tu; 
a; m PE [| ell + (G: ES BÊE [n] o [in] 

gui] giu] g[ui] gui] 
as well as the M x M and 2M? x 2M? matrices 


p& 3E andio Grea ae. ee 
glui] Ily2 0 


Then we showed that the transient behavior of data-normalized adaptive filters is characterized 
by the equations: 


= (I—-,LP)Edwi.i; (mean behavior) 


FWii-)oly (mean-square behavior) 


where F is the companion matrix 


0 1 
0 0 1 
0 0 0 1 
F= f (M? x M?) 
0 0 0 1 
-po -pı -p2 —pM2-1 


with 
M?-1 


p(x) Ê det(zI — F) =% 4 > pks” 
k=0 


denoting the characteristic polynomial of F. Also, W; is the M? x 1 state vector 


A E > E m 
We © col (Elis E loses, E fs gs. E [sl]? 202-159} 


q = vec(I), and the k-th entry of Y is given by 


lus] e 
g? [ui] 


D eel | k=0,1,...,M?-1 


6. We further showed that a data-normalized filter is stable in the mean and mean-square senses 
for step-sizes ji satisfying 


i 2 1 1 
as n enini Amax( A7 1B)' max (A(H) a] 
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The corresponding mean-square deviation and excess mean-square error are given by 


MSD = p?02Tr(SZmsa) and EMSE = 4Zo?2Tr(SXense) 


sat (55) 
9 (us) 


and (Xa, Nemse} are the weighting matrices that correspond to the vectors 


where 


Omsa = (I — F)~'vec(I) and Oemse = (I v F)~*vec(Ru) 


That is, Emsa = vec! (Somsa) and Xemse = vec” (Cemse). All these results hold irrespective 
of the distribution of u; (e.g., u; is not required to be Gaussian). 


. In general, it hard to evaluate the matrices (A, B, P) in closed-form for arbitrary regressor 


distributions and, therefore, it is hard to evaluate the upper bound on y for mean-square stabil- 
ity. However, it is shown in Prob. V.39 that if the distribution of u; is such that the matrix B 
is finite, then there always exist small enough step-sizes that guarantee mean-square stability 
of data-normalized adaptive filters. 


. The above results are specialized in the chapters to some well-known algorithms, such as LMS 


and e-NLMS. 


a) For LMS, we have g(u;) = 1. When the (ui) are Gaussian, it is shown in Thms. 23.1 
and 23.2 that LMS is mean-square stable for step-sizes jz satisfying 


k=1 
with 
244 A 2 
uos o pes erm HO» 2, Ieru 
EMSE = = , MSD =—*— 
A à 

1-45, 2—3àkH 1 -u5 2- såk k 
where s = 1 if the regressors are complex-valued and s = 2 if the regressors are 
real-valued. 


b) For e—NLMS, we have g(u;) = € + ||u;l|?. We state in Sec. 25.1 (and prove in 
App. 25.B) that the filter is mean-square stable for any 4 < 2, regardless of the distri- 
bution of u;. 


c) For Gaussian regressors, we also argue in Secs. 23.1 and 25.1 that the M?-dimensional 
state-space model for W; reduces to an M —dimensional state-space model for both 
LMS and e—NLMS. 


9. For adaptive filters with error nonlinearities, 


wi=wi-itpuigie(i)],  e(i) = d(i) — uiwi-r 


it is shown in App. 9.C of Sayed (2003) that the variance relation takes the form 


E lð]? = E lõ: + w?AvTr(RuZ) — 24Re(hgE [ii-ill 35, ) 


where the functions {hu , hg} are defined by 


hu 4 E |gle(a)]I?, hg A Eez(i) gl e() | 


Elea (i)? 


This variance relation is used in Sayed (2003) to characterize the transient behavior of several 
adaptive filters with error nonlinearities. It turns out that the transient performance of such 


filters is more challenging to study due to the nonlinear nature of the resulting state-space 
model. 


BIBLIOGRAPHIC NOTES 


Energy conservation. The energy-conservation relation (22.12) with weighting is an extension 
of the unweighted relation (15.32), which was derived by Sayed and Rupp (1995) in their studies on 
the robustness and small gain analysis of adaptive filters (see Chapter 44). The extension (22.12), 
along with its variance relations (22.21) and (22.26)-(22.27), were derived by Al-Naffouri and Sayed 
(2001a) and used therein, as well as in subsequent works by the same authors to study the transient 
performance of adaptive filters in a unified manner. The presentation in this part follows mainly 
Al-Naffouri and Sayed (2001ab,2003ab), Sayed and Al-Naffouri (2001), and Sayed, Al-Naffouri 
and Nascimento (2003). The extension of the transient analysis to affine projection algorithms is 
presented in Shin and Sayed (2004) and in Prob. V.37. 


Real vs. complex data. Performance analyses in the literature tend to be carried out separately 
not only for different algorithms, but also for real-valued data as opposed to complex-valued data. 
In our treatment in this book, we have chosen to maintain a uniform treatment and notation for both 
cases of real and complex data. 


Performance results. Some of the early works on the transient performance of LMS, for both 
stationary and nonstationary environments, is the article by Widrow et al. (1976). Additional early 
works on LMS include Jones, Cavin, and Reed (1982) and Feuer and Weinstein (1985). This latter 
reference derived the necessary and sufficient condition for the mean-square stability of LMS for 
Gaussian uncorrelated regression data using the independence assumptions (cf. Thms. 23.1 and 23.2), 
as well as the expressions for the filter EMSE and MSD performance (cf. Thm. 23.3). A related 
analysis, using a different approach than Feuer and Weinstein (1985), has also been advanced by 
Horowitz and Senne (1981). The transient behavior of NLMS algorithm, on the other hand, has 
been studied in a series of works including Bershad (1986b,1987), Tarrab and Feuer (1988), Rupp 
(1993), and Slock (1993). The transient behavior of sign-error LMS and sign-regressor LMS has 
been studied by Bershad (19862,1988), Mathews and Cho (1987), and Eweda (19902). Studies that 
involve matrix data nonlinearities, or other forms of data nonlinearities, can be found in Bershad 
(19862), Mikhael (1986), Harris, Chabries, and Bishop (1986), and Douglas and Meng (1994b). In 
all these earlier works, it is often assumed that the regression data is white (i.e., uncorrelated) and/or 
Gaussian. It is also generally observed that most works study individual algorithms separately. 

Good treatments of these earlier developments, but mainly for LMS, appear in the textbooks by 
Widrow and Stearns (1985), Bellanger (1987), Solo and Kong (1995), Macchi (1995), Haykin (1996), 
Farhang-Boroujeny (1999), Manolakis, Ingle, and Kogon (2000), Treichler, Johnson, and Larimore 
(2001), and Diniz (2002). Macchi's (1995) book provides one of the most thorough treatments of 
the convergence behavior of LMS for both independent and N-dependent regressors, while Solo and 
Kong's (1995) book relies on averaging theory and small step-size assumptions. 


LMS and independence assumptions. The independence assumptions (22.23) (actually, the 
more detailed assumptions (i)-(vi) of Sec. 16.4) have been applied to the study of LMS since the 
late 1960's by Widrow et al. (1967,1975) for the case of Gaussian variables. The motivation for their 
use was mainly to obtain a tractable mathematical framework. However, it was not until Feuer and 
Weinstein (1985) that a precise study of LMS with independent Gaussian regressors was developed, 
and an improved derivation was given later by Foley and Boland (1988). An analysis for non- 
Gaussian variables was given by Hsia (1983), but the results were not as detailed as in the Gaussian 
case. 


NLMS and independence assumptions. A precise analysis of the NLMS algorithm with Gaus- 
sian independent regressors was given by Tarrab and Feuer (1988). A simplified data model for 
performance analysis was later developed by Slock (1993). 
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Leaky-LMS and independence assumptions. In Probs. V.28-V.32 we follow Sayed and Al- 
Naffouri (2001) and perform a transient analysis of the leaky-LMS filter for both cases of Gaussian 
and non-Gaussian regressors. We derive necessary and sufficient conditions for its mean-square 
stability under the independence assumptions. Earlier transient analysis results for leaky-LMS were 
derived by Mayyas and Aboulnasr (1997) for Gaussian regressors; however, the stability conditions 
in this latter reference were only sufficient. 


Affine projection algorithms. The transient behavior of affine projection algorithms is not as 
widely studied as that of LMS or NLMS. The available results have progressed more for some varia- 
tions than others, and most analyses assume particular models for the regression data. For example, 
in Apolinário, Campos, and Diniz (2000) convergence analyses in the mean and in the mean-square 
senses are presented for the binormalized data-reusing LMS (BNDR-LMS) algorithm using a par- 
ticular model for the input signal. Likewise, the convergence results in Sankaran and Beex (2000) 
focus on APA with orthogonal correction factors (APA-OCF) and rely on a special model for the 
input signal vector. In Bershad, Linebarger and McLaughlin (2001), the theoretical results of Rupp 
(1998a) are extended to the evaluation of learning curves assuming a Gaussian autoregressive input 
model. In Shin and Sayed (2004), a treatment of the transient performance of a family of affine 
projection algorithms is provided by relying on the energy-conservation arguments of this chapter. 
Prob. V.37 specializes the transient analysis of this work to e-APA. 


How valid are the independence assumptions? The independence assumptions (22.23) (ac- 
tually, the more detailed assumptions (i)-(vi) of Sec. 16.4) have been applied to the study of LMS 
since the late 1960’s. Although unrealistic for tapped-delay-line implementations, they tend to give 
good results that match real filter performance for small step-sizes. Several studies in the literature 
have struggled with this observation in an attempt to explain it. In App. 24.A we describe some of 
these results. Basically, it is shown in the appendix that if u œ% 0, the conclusions that are obtained 
using the independence assumptions are good approximations for the real performance of LMS. 

One of the earliest results in this regard is that of Mazo (1979), which established Thm. 24.2. The 
arguments in this reference assume that the regression data are generated by feeding a binary i.i.d. 
sequence, say (s(i)), through an FIR filter. Mazo's result then ascertains that for tapped-delay-line 
implementations with such regression data, the conclusions of an independence analysis match the 
exact filter performance for sufficiently small step-sizes. This result was later strengthened by Jones, 
Cavin, and Reed (1982) to the case in which {s(i)} was not restricted to being binary but could be 
any i.i.d. sequence. They established Thm. 24.3. A similar conclusion appears in Macchi (1995, 
p. 114), whereby it is shown that results under slow adaptation arc essentially equivalent to those 
obtained under independence. 

The above studies assumed (or required) filter convergence. The result in Macchi and Eweda 
(1983) closed the gap by showing that there does exist a small enough step-size that guarantees 
filter convergence even when the regressors are not independent; the authors assumed only that the 
regressors are N -dependent and established Thm. 24.4 — see also the book by Macchi (1995). 


Averaging analysis. The results described in the previous paragraph on the practicality of the 
independence assumptions for slowly adapting filters are specific to LMS. Another approach for 
studying the performance of LMS without the independence assumptions, as well as for studying the 
performance of a larger class of adaptive filters, is to rely on averaging analysis. This approach is 
reviewed in App. 24.A, where it is seen that it requires the assumption of infinitesimal step-sizes. A 
key result in this framework is Thm. 24.5 (see Solo and Kong (1995, Chapter 9)), which essentially 
states that for small step-sizes, the performance of an adaptive filter can be determined by studying 
the performance of a so-called averaged filter; the averaged filter is such that its analysis may not re- 
quire the independence conditions but still requires u ~ 0. Examples of studies that are motivated by 
averaging arguments can be found in Butterweck (1995,2001) and Treichler, Johnson, and Larimore 
(2001, p. 95). For more detailed treatments of averaging analysis see the books by Kushner (1984), 
Benveniste, Métivier, and Priouret (1987), Solo and Kong (1995), and Kushner and Yin (1997). 


ODE method. Another contribution to the analysis of adaptive filters without the independence as- 
sumptions is that of Ljung (1977ab), who developed an approach known as the ordinary-differential- 
equation (ODE) method. The idea is to reduce the adaptive filter difference equation to a differential 


equation, and to study the convergence and stability properties of the resulting continuous-time sys- 
tem by using, for example, Lyapunov stability methods. The convenience of this approach relies on 
the fact that the stability of differential equations is a well-studied subject (see, e.g., Bellman (1953)), 
and many results from this field could therefore be used in the study of adaptive filters. However, 
just like the averaging method of App. 24.A, the ODE approach is only applicable to infinitesimal 
step-sizes (u % 0). Moreover, in his original work, Ljung (1977ab) only studied the case of time- 
variant step-sizes that tend to zero; a condition that hinders the tracking ability of an adaptive filter 
in steady-state. Extensions of the ODE method to deal with constant step-sizes were later developed 
by Kushner (1984) and Benveniste, Métivier, and Priouret (1987). 

Briefly, the basic idea of the ODE method as applied to LMS is the following. Starting from the 
LMS update w; = wi-1 + uu; e(i), with e(i) = d(i) — uiwi-, and iterating it over L iterations 


we obtain 
i+L 


With = Wi-1 + us. uje(k) 
k=i 
Assuming stationary and ergodic processes, and a sufficiently small-step size (so that the weight- 
vectors do not vary appreciably over L iterations), the following approximation becomes possible: 


i+L 
I X uie(k) wEuje(i) = Ra, — Ruwi-1 
k-i 


where the expectation is computed conditioned on 1;- : (i.e., assuming 1w;— is fixed). Then 


WirL — Wi-1 


uL xX [Rau oa Ruwi-1] 


which, for small jz, suggests a differential equation in the weight vector of the form: 


a) = Rau aT Ruw(t) 

t 

The convergence properties of this equation can now be studied using classical results from the theory 
of differential equations. For the above example of LMS, this step is trivial since all eigenvalues of 
— R,, lie in the open left-half plane and, therefore, the resulting dynamic behavior is stable. For more 
general adaptive filters, the corresponding differential equations will be more involved. 


Which method of analysis is more accurate? There is a lively discussion in the literature 
on which method is more appropriate for the analysis of adaptive filters. Those who argue against 
the use of the independence assumptions state, and rightfully so, that these assumptions do not hold 
for tapped-delay-line implementations. On the other hand, analyses that are based on averaging 
and ODE methods are only valid for infinitesimal step-sizes. When you judge this fact against the 
observation that the independence conditions give good results for small step-sizes, then we are back 
to square one! Thus it seems that if one is interested in studying the performance of adaptive filters 
for sufficiently small step-sizes, then any of these methods should be fine — see the discussion in 
App. 24.A. Itis truly remarkable that the independence assumptions, being theoretically invalid, still 
give good results for small step-sizes. There does not seem to exist a complete satisfactory answer 
to this mystery to this date (except for the early investigations mentioned before by Mazo (1979), 
Jones, Cavin, and Reed (1982), and Macchi (1995, p. 114)). 


Exact performance analysis. The next natural step in the analysis of adaptive filters is to inquire 
whether it is possible to study their performance without having to assume the independence condi- 
tions or even small step-sizes. In this context, the following questions would become relevant and 
they remain largely unanswered in the literature: 
1. How large can the step-size be so that the independence-based approximations are still rea- 
sonable? Also, for a given value of the step-size, what is the order of magnitude of the error 
incurred by using these approximations? 


2. What is the real performance of an adaptive filter when the step-size is not small and the 
independence assumptions are not used? 
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3. How large the step-size can get without compromising filter (mean-square) stability? 
4. What is the step-size that gives the fastest convergence rate? 


There are very few results in the literature that predict or confirm the behavior/stability of even LMS 
under these conditions. 

One original work that attempted to study the performance of LMS without independence and 
slow adaptation assumptions was that of Florian and Feuer (1986). However, as is clear from the title 
of their article, the arguments were limited to filters with only two-taps (and there is a good reason 
for this — see below). Their idea was to show that studying the performance of LMS in essence 
amounts to studying the stability of a state-space model of the form 


fei = O(u)rk +z 


where z is some constant vector, zy is a state vector, and ® is some constant matrix that depends 
on the step-size. With this model, the largest step-size (Umax) that guarantees stable performance of 
LMS is the largest jz for which (u) is still a stable matrix, i.e., 


Umax £ sup{p such that p(®(u)) < 1} 


where p(®) denotes the spectral radius of &, i.e., p(&) = maxi|Ai(®)]. 

Unfortunately, determining max is a highly nontrivial task for two main reasons. First, the 
eigenvalues of the matrix ® depend nonlinearly on u and, secondly, the dimension of ® grows 
extremely fast with the filter length (e.g., for a filter of order M = 6 the matrix has dimensions 
28, 181 x 28, 181). It is computationally infeasible to work directly with ©; the approach is feasible 
only for relatively small filter lengths. It was for this reason that Florian and Feuer (1986) considered 
only the case M = 2 (i.e., a filter with two taps), and later Douglas and Pan (1995) used the same 
method for orders up to M = 6 coupled with a numerical procedure (namely, the power method for 
sparse matrices) to evaluate the eigenvalues of ®. 

For longer filters, Nascimento and Sayed (1998) developed an alternative procedure for estimating 
a lower bound for max that does not work directly with the matrix ©. Their approach was based 
on the observation that ©, although of large dimensions, is both sparse and structured. These two 
properties can be exploited to derive a computable bound on the step-size for stable performance. 

Another notable existence result was given by Solo (1997). This work provides a bound for 
the step-size that guarantees almost-sure stability of LMS. Unfortunately, however, the bound is not 
computable and, as explained in App. 9.E of Sayed (2003), almost-sure stability does not necessarily 
imply mean-square stability or even reasonable performance. 


Adaptive filters with error nonlinearities. In this part we focused on adaptive filters with data 
nonlinearities. Appendix 9.C of Sayed (2003), on the other hand, considers adaptive filters with error 
nonlinearities. This class of filters is among the most difficult to analyze. For this reason, it has been 
common in the literature to resort to different techniques and assumptions with the intent of allowing 
tractable analyses. Some of the most common approaches in this regard are the following. 


1. The use of linearization as, for example, in Duttweiler (1982), Walach and Widrow (1984), 
Gibson and Gray (1988), Sethares (1992), and Douglas and Meng (19942). In this method 
of analysis, the error nonlinearity is linearized around an operating point and higher-order 
terms are discarded. Analyses that are based on this technique fail to accurately describe the 
adaptive filter performance at the early stages of adaptation where the error usually assumes 
larger values. 


2. Restricting the class of error nonlinearities as, e.g., in Claasen and Mecklenbráuker (1981), 
Mathew and Cho (1987), Bershad (1988), Tanrikulu and Chambers (1996), and Chambers, 
Tanrikulu, and Constantinides (1994). In this case, the analysis is restricted to particular 
classes of algorithms such as the sign-error LMS algorithm, the least-mean mixed-norm 
(LMMN) algorithm, the least-mean fourth (LMF) algorithm, and error saturation nonlineari- 
ties. By limiting the study to a specific nonlinearity or to a class of nonlinearities, it is possible 
to avoid linearization and the results become more accurate. 


3. Imposing statistical assumptions on the distribution of the error signals as, e.g., in Koike 
(1995) where the elements of the weight-error vector were assumed to be jointly Gaussian 


while studying the sign-error LMS algorithm. This particular assumption was shown to be 
valid asymptotically in Sharma, Sethares, and Bucklew (1996). More accurate is the as- 
sumption that the residual error is Gaussian as in Duttweiler (1982) and Bershad and Bonnet 
(1990), or that its conditional value is Gaussian as in Mathews and Cho (1987) and Bershad 
(1988). Although it has been argued in Masry and Bullo (1995) that the Gaussian assumption 
on €,(i) for the sign-error LMS algorithm does not exactly hold under the independence as- 
sumption, still the Gaussianity assumption on es () is reasonable for long filters by central 
limit arguments. 


4. Restricting the class of input data, such as assuming white and/or Gaussian regression data 
as done, for example, in Claasen and Mecklenbrauker (1981), Duttweiler (1982), Gardner 
(1984), Gibson and Gray (1988), Rupp (1993), and Douglas and Meng (19942). 


5. Using the independence assumptions, whereby the successive regressors are assumed to be 
independent of each other as in Mazo (1979). Despite being unrealistic, the independence 
assumptions are among the most heavily used assumptions in adaptive filtering analysis and 
they tend to lead to results that agree with practice for small step-sizes. 


6. Assuming Gaussian white noise. Although Gaussianity is not as common as the whiteness 
assumption, the whiteness condition on the noise allows for more tractable analyses. 


Learning curves. Ensemble-average learning curves are commonly used to analyze and illustrate 
the performance of adaptive filters. They are obtained by averaging several squared-error curves 
over repeated experiments. Such averaged curves have been used extensively in the literature to ex- 
tract information about the rate of convergence of an adaptive filter, its steady-state performance, or 
choices of step-sizes for faster or slower convergence. For infinitesimal step-sizes, or under the inde- 
pendence assumptions (22.23), it is known that information extracted from such ensemble-average 
learning curves provide reasonably accurate information about the real performance of an adaptive 
filter (see, e.g., the works by Widrow et al. (1976), Macchi and Eweda (1983), Bershad (1986b), 
Macchi (1995), Solo and Kong (1995), and Kushner and Vázquez-Abad (1996)). 

Appendix 9.E of Sayed (2003), however, examines the performance of adaptive schemes for 
larger step-sizes. By larger step-sizes it is not meant step-sizes that are necessarily large, but rather 
step-sizes that are not infinitesimally small. The presentation in the appendix follows the work by 
Nascimento and Sayed (2000). In the process of comparing results obtained from ensemble-average 
learning curves with results predicted by an exact theoretical analysis, the authors observed some 
interesting differences between theory and practice. As shown in their article and in the appendix 
(see also Sayed and Nascimento (2003)), the differences in behavior can be explained by examining 
jointly the mean-square and almost-sure convergence of an adaptive filter. Both forms of convergence 
tend to agree for sufficiently small step-sizes, but they behave differently at larger step-sizes; thus 
leading to the occurrence of interesting phenomena in the learning behavior of an adaptive filter. 


Almost-sure stability of filters. Prior to Nascimento and Sayed (2000), there have been several 
works in the literature on the almost-sure stability of LMS, e.g., by Bitmead and Anderson (1981), 
Bitmead, Anderson, and Ng (1986), and Solo (1997), or even more generally on the almost-sure 
stability of continuous-time systems, e.g., by Kozin (1969) and Parthasarathy and Evan-Iwanowski 
(1978). These earlier works focused mainly on infinitesimal step-sizes for which the distinctions be- 
tween mean-square stability and almost-sure stability do not manifest themselves. The earlier works 
were also not interested in how the mean-square and almost-sure convergence performance of the 
filters differed for larger step-sizes. For example, Bitmead, Anderson, and Ng (1986) only compared 
both notions of stability for y zz 0, when they in fact agree. Moreover, some of these earlier in- 
vestigations may suggest that almost-sure stability implies reasonable algorithm performance (Solo 
(1997)). The material in App. 9.E of Sayed (2003) shows that this is not necessarily the case; an 
almost-sure stable filter might have poor performance when it is not also mean-square stable. This is 
because there is a small time interval during which the ensemble-average learning curve tends to stay 
reasonably close to the (mean-square) theoretical learning curve. In this way, an almost-sure stable, 
but mean-square unstable, algorithm would likely have its error diverging to a large value before 
starting to converge. 
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PROBLEMS 


Problem V.1 (Weighted norms) Introduce the notation ||x||} = “£x, where X is a Hermitian 
and non-negative definite matrix. 


(a) Show that ||z||z = V z*3iz is a valid vector norm when X is positive-definite, i.e., verify that 
it satisfies all the properties of vector norms as described in Chapter B. 


(b) When È is singular, which properties of vector norms are violated? 


Problem V.2 (Positive-definiteness of the weighting matrix) Referto expression (22.20) for 


X’ and assume X > 0. 
'- fi- uiw) s (1- uit) 
( "giu. "gud 
Conclude that X' > 0. 


(b) Now let XY’ = ED’. Clearly, ©’ > 0 in view of part (a). We wish to show that it is actually 
positive-definite. So assume, to the contrary, that there exists a nonzero constant vector x such 


that X/z = 0. Define 
ys (r- S8) z 
glu:] 


Show from X/z = 0 that y = 0 almost surely. Argue that this is only possible if the regressor 
u% is non-random (i.e., a constant vector). 


(a) Verify that 


Problem V.3 (Special matrix) Consider a nonnegative-definite matrix F of the form F = I — 
uA + u? B, for some positive-definite matrix A, nonnegative-definite matrix B (assumed nonzero), 
and a positive scalar p. We would like to determine a bound on y such that the eigenvalues of F are 
less than unity. 


(a) Argue that the required condition on the eigenvalues of F is equivalent to A — uB > 0. 


(b) Let A = UAAAUTA denote the eigen-decomposition of A, where U4 is unitary and A4 is 
diagonal with positive entries. Argue that the matrices A— 4B and I- uA 4 ^ U4 BUAA, ? 
are congruent, where AY ? denotes a diagonal matrix with the positive square-roots of the 
eigenvalues of A. Conclude that the condition on p is 


1 


HS 2 wd 

Amax(A3 CUTBUAM ) 
Argue that the matrices Az” ? U3 BU. aA ? and A71 B are similar. Conclude that all the 
eigenvalues of A^! B are nonnegative, and that an equivalent characterization for the condi- 
tion on p is u < 1/Amax(A7?B). Conclude further that if B is positive-definite then the 
eigenvalues of A^! B are positive and real. 


(c 


— 
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(d) Show further that the upper bound 1/Amax(A 1 B) is equal to the smallest 7 that results in a 
singular matrix A — 7B. 


Problem V.4 (Indefinite coefficient matrix) Consider the matrix F defined by (24.10) and as- 
sume all data are real-valued. Assume also M — 2 and generate the regressors as follows: 


i 0) with probability 1/2 
uUi = 
0 1/v2 | with probability 1/2 
(a) Show that in this case, F = diag{1 — p+ 12/2, 1 — 34/4, 1 — 34/4, 1 — u/2 + p?/8}. 


(b) Verify that F can become indefinite. Do the eigenvalues of F cross first +1 or —1 as p 
increases? 


Problem V.5 (MSE of LMS) Consider the setting of Thm. 23.3, and let c denote the constant 


Verify that the mean-square error (MSE) of LMS, which is the steady-state variance of e(7), is given 
by MSE = o2/(1 — pc). 


Problem V.6 (A positive-definite covariance matrix) Consider regressors {u; } with a positive- 
definite covariance matrix Ry. Define the modified regressor à; = ui//e + ||u;||?, and let Ru 
denote its covariance matrix. 


(a) Prove that R., is positive-definite. 


(b) Use the properties of Kronecker products from Lemma B.8 to show that for any two n x n 
matrices (D, E), the eigenvalues of (18 D) + (E QI) are all the n? combinations (A (D) + 
A;(E)) forall l € j,k € n. 


(c) Now consider the matrix A defined by (25.4). Show that A > 0. 
(d) Likewise, show that the matrix B in (25.5) is nonnegative definite, B > 0. 


Problem V.7 (Transient behavior of LMS) Start from (24.11) and show that 


E lð: ? = E lõi? + wot (r" Fa) — lw?llz« a ey, 


where (F, r, q} are as defined in Sec. 24.1. 


Problem V.8 (Kronecker matrix inequality) Assume Z is a rank-one M x M matrix. Denote 
its eigenvalues by {A,0,0,...,0} and assume 0 « A < 1. Introduce the eigen-decomposition 
Z = VAV" with V unitary and A = diag{A,0,..., O}. 


(a) Use property (i) from Lemma B.8 to verify that 
Z'ez = (V'eVy(AeAY(V'eV', Z'ai-(V" VASI (V QV") 
(b) Let C = (VT & V*) and define y = Cz for any nonzero vector x. Denote the entries of y by 
[y(i), i = 1,..., M?). Verify that 


M 
a'(Z'eZs-XlyDf,  s(Z ens- rly)? 
i=1 


and conclude that (Z' & Z) < (Z' @1). Likewise, show that (Z' & Z) € (IQ Z). Conclude 
that 2(Z" @ Z) < (Z' &I) + (19 Z). 
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(c) Inequality (25.31) is obviously true when u; = 0. When u; 4 0, choose Z = ujui/(e + 
|[14; ||?) and use the result of part (b) to establish the validity of (25.31). 


Problem V.9 (Learning curves) Consider the general setting of Thm. 25.2, which deals with 
data-normalized adaptive filters. The purpose of this problem is to use the variance relation (25,14) 
to characterize not only the time evolution of E ||i;||?, as was done in the statement of theorem, but 
also the time evolution of E [e; (i)|?, which relates to the learning curve of the filter. 

Recall that the learning curve of an adaptive filter is defined by E |e(i)|?, which in view of (22.16) 
is given by E |e(i)|? = E|ea(2)|? +02, so that the learning curve can be equivalently characterized 
by studying the evolution of E|ea(i)|?. Under the conditions on the data stated in Thm. 25.2, we 
know from (23.55) that E|ea(i)\? = E||t@i—1|\k,. This means that the learning curve can be 
determined by studying the evolution of the above weighted norm of w;-1. This evolution can be 
deduced from the variance relation (25.14), namely, 


Ell? = Elöl? 272E || willz 
li: lo = l-l + H Tv g?[u.] 
by properly choosing the weighting factor c. 


(a) Choose c = vec{ R4} = r (i.e., choose the weighting matrix as X = E). Use the weighting 
matrix recursion (25.18), and iterate the above variance relation, to deduce that 


2 
ui ie 
E lal? =E || a lhin + woe fees tana Vrere rte er ) | 


g^ [ui] 
where 1b; = w^, assuming w_1 = 0. 

(b) Verify that the expression for E ||»; ]|2 from part (a) can be rewritten as E ||;|[2 = |[w^|I2, + 
uo b(i), where the vector a; and the scalar b(i) satisfy the recursions 


luilla; 


ai = Faia, i9 296-2 | g [ui] 


| , G-i-r, b(-1)-20 


(c) Conclude that the following recursion holds for E |e, (i)]?, 


Elea(i)|? = Ele«- Df ~ lw?liciu e. + ze [Iis] 


g? |ui] 


(d) Use the recursion of part (c) to re-derive the expression for EMSE of Thm. 25.2. 


(e) Specialize the result of part (c) to the LMS algorithm assuming circular Gaussian regressors, 
and re-derive the expression for the EMSE of LMS from Thm. 23.3. 


Problem V.10 (Learning curve of LMS) Assume circular Gaussian regressors (u;). Start from 
(23.19) and repeat the argument of Prob. V.9. Use the resulting learning curve to re-establish the 
expression for the EMSE of LMS from Thm. 23.3. 


Problem V.11 (Non-Gaussian data) When the regression data are not Gaussian, it may still be 
possible to evaluate the last moment in (24.1) explicitly, in which case one can rely on the techniques 
of Chapter 23 as opposed to Chapter 24. Thus consider an adaptive filter with a 2-dimensional 
regression vector u; = | wi(1) u:(2)}. At every time instant i, the entries (u;(1), u;(2)) are 
chosen independently of each other as follows: 

ais 2a with probability 1/3 u:(2) = —b with probability 3/4 

k —a with probability 2/3 i 3b with probability 1/4 

where a and b are positive real numbers. Moreover, the successive regression vectors (wu) are 
independent of each other. 


(a) Find the variances of u;(1) and w;(2) and the covariance matrix of u;. 
(b) Characterize the learning curve of the LMS filter. 


(c) Find exact expressions for the EMSE and MSD that would result when the filter is trained 
using LMS. 


(d) Find conditions on the step-size to ensure stability in the mean and mean-square senses. 


(e) How would the results of parts (c) and (d) change if u;(1) and w;(2) were replaced by inde- 
pendent zero-mean Gaussian random variables with the same variances? 


Problem V.12 (Combination of adaptive filters) Refer to Prob. IV.4. 


(a) Find exact conditions on the step-sizes jj; and 42 in order to guarantee the mean-square 
convergence of the combined adaptive structure. 


(b) Find exact expressions for the MSDs of the two adaptive filters, w1,; and wo,i. 
(c) Find an expression describing the time evolution of the learning curve E |e(i)|?, where e(i) = 
d(i) — y(i). 


Problem V.13 (Diffusion strategy) Consider two nodes at spatial locations 1 and 2. At each 
time instant 7, each node k = 1, 2 has access to data (a9, ul} satisfying the standard data model 


di? = uw? + v, k=1,2 


for the same unknown vector w°. The zero-mean regression sequences {us} are independent 
over both time and space (i.e., over 4 and k) with covariance matrices {RY}. Likewise, the noise 
sequences (v9) are both temporally and spatially white with variances {os}. As indicated in 
Fig. V.1, each node & runs an LMS type filter that uses both temporal and spatial data as follows: 


spatial information : 
oi? = aw), bs aw, 
oP = Bw, +- B)w!?, 


temporal information : 


WP - AP a pal [aP — 0g G0 


2 2 2) [42 (2) 4 (2 
w? = e + uu [49 ufo] 
That is, each node first combines the existing weight estimates at both nodes to generate the interme- 
diate estimates (9, 6}. The combination uses coefficients (0 < a, 8 < 1}. Subsequently, the 
SA w(? }, and the process continues. 


i 


intermediate vectors are updated to (w 


taf, uP, w, o) 


wD Node 2 
2 
Node 1 wf ) 


(a9, ul, w, a} 


FIGURE V.1 Two adaptive nodes sharing temporal and spatial information. 
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(a) Find a condition on the step-size jz such that the mean behavior of the weight-error vectors of 
both adaptive filters is stable. 
(b) How would the condition on yz change if a = 1 and 8 = 1, i.e., if the nodes do not share 
spatial information? 
(c) Assume a = § = 1/2. Argue that for the same step-size, the means of the weight error 
vectors will generally converge faster to zero than when there is no spatial cooperation among 
the nodes (as in part (b)). 
Remark. This example is a special case of more general distributed adaptive filters derived by Lopes Sayed 
(2006,2007a,b,2008), Sayed and Lopes (2007), and Cattivelli, Lopes, and Sayed (2008). The distributed schemes 
incorporate various forms of spatial cooperation among nodes. The term diffusion that appears in the title of 
this problem refers generally to the manner by which information is shared or diffused among the nodes in a 
neighborhood. 


Problem V.14 (Performance of LMS with small step-size) Consider the discussion follow- 
ing Thm. 24.1, namely, that for a sufficiently small step-size, the variance relation (24.2) can be 
approximated by 
E ||: || Elli- + pose |les|l% 
> = X- yXE [uiui] — pE [uui] X 
Introduce the eigen-decomposition Ry = U AU", where A is diagonal with the eigenvalues of Ry 
and U is a unitary matrix. Introduce further the transformed quantities (22.31). 


(a) Verify that the above variance relation becomes 


E |f; = Ela: l$ + Mose [Pails 
b € uEA - uA 


Conclude that ©’ is diagonal if X: is. 

(b) Let 7 = diag{X} and À = diag(A). Verify that the expression for X reduces to a’ = Fa, 
where F = I — Qua. 

(c) Follow the arguments that led to Thm. 23.3 and show that EMSE = po?2Tr(R,)/2 and 
MSD = uo? M/2. Compare the expression for EMSE with that in Lemma 16.1. 


Problem V.15 (Numerical example) Consider the setting of Thm. 24.1 and assume all variables 
are real-valued. Set M = 2, i.e., the regressors are two-dimensional, say u; = [ uili) uli) k 
and assume the entries {u (i)} are i.i.d. and uniformly distributed between —1 and 1. 

(a) Verify that Ry = 12/3, (12 & Ru) = (Ru @ I2) = 14/3, and 


1/5 0 0 1/9 
0 1/9 1/9 0 
0 1/9 1/9 0 
1/9 0 0 1/5 


E (ului ® ului) = 


(b) Evaluate E ||Ū ||? for the choices 44 = 0.01 and u = 0.5. For what range of step-sizes is the 
filter mean-square stable? 


Problem V.16 (Diagonal moments) Consider row vectors 7%; that are zero-mean circular Gaus- 
sian with a diagonal covariance matrix A. Define the matrices 


— 27"327 T 7. 2 
A SE uiui | E | uiui | N B Se [us ur 
[zt] * 5 [eri (e+ pra] t 
(a) Verify that the (j, k)—th element of A is given by 


ECTS E123 Ws Ed | 


where %;(j) denotes the j—th entry of àz;. For j # k, argue that the first term in the above 
expression for A; is an odd function of %@;(k), while the second term is an odd function of 
ui (3). Conclude that Aj, = 0 for j Æ k. 


(b) Assume X is diagonal. Use a similar argument to establish that B’ is diagonal. Verify that the 
k-th diagonal element of B' can be expressed as 


jalg 


= 2 
d = MWE a = NE ig. o (at ys 
cd * [cnr] = e [eis te js 


where the notation o stands for Hadamard (or elementwise) product of two vectors. 


(c) Let b denote a vector with the diagonal entries of B’, i.e., b’ = diag( B’}. Verify that b' can 
be expressed as b’ = BG where B is the positive-definite matrix 


o | oa] u o T] _ uu] o [ua] 
s=e| ) | -el | 


(e + |[tsl|?)? (e+ |ui? 


and @ is the vector containing the diagonal entries of D. 


Problem V.17 (Gaussian regressors and e-NLMS) Refer to the discussion in Sec. 25.1 on the 
transient performance of e-NLMS and, in particular, to Thm. 25.1. The purpose of this problem is to 
show that the M?-dimensional problem stated in the theorem reduces to an M —dimensional model 
when the regressors are Gaussian. 

Thus assume the {u;} are circular Gaussian so that the transformed regressors t; defined by 
(22.31) are also circular Gaussian. Since the covariance matrix of %; is diagonal, and equal to 
A, the individual entries of W; will be independent of each other. With Gaussian regressors, it is 
convenient to work with the transformed versions (22.33)-(22.35) and (22.36) of the variance and 
mean relations, which in the e-NLMS case are given by 


ull. 
Eli ja + ete Es 


(e+ Tel?) 
Sor- [BE ) - ve EE. E ua; 
e+ ade e p (e+ Tl)? 
uiui €: 
Ew; = |I - uE | ———— }| Ewi- 
i | s (Sieg) |e 


The moments that need to be evaluated are therefore 


p Ile e [E] p leli a 
(e + |fauil|?)? } e+ uiti (e+ fuil)? * 


(a) Use the results of Prob. V.16 to conclude that F will be diagonal when X is. 


(b) Verify that the expression for 5’ can be rewritten as 2 = Fo, where the M x M matrix F 
is given by F =I — pA + p?B with 


= —*— zh 
A &à dE d e (|e = p4p’ 
eo, * [exa ES 
B B E GERIT. EMI. e ud |. p&tr E. | 
(e+ [EA EAE 


where the notation o stands for Hadamard (or elementwise) product of two vectors. 


(c) Conclude also that the recursion for E wW; can be written as EW; = (I — LP) E W:-1, and 
that the transient behavior of e—NLMS is characterized by the M-dimensional state-space 
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equation W; = FWi-1 + 702, where 


e 
n 


E |[w;||e 
0 0 1 A 
0 0 0 1 E [2i] e, 
Fa| © w; =| Elaillzz, 
0 0 0 1 2 
= ET: -oy 
—po -p -p2 PM-1 J MxM s 


with 
p(z) & det(zI — F) 2zM + > prz” 


denoting the characteristic polynomial of the matrix F, and where g is an M —dimensional 
vector and the k-th entry of Y is 


[uil 


[EY], = ont k20,...,M-1 


Remark. These definitions of {W;, Y} are in terms of any c of interest, e.g., most commonly, F = q or 
F = A, where A = diag(A). 

(d) Verify that the mean-square deviation and the excess mean-square error of the algorithm are 
given by 


Msp = ote | lcm. EMsE = poze | oma 
= B E | e+ [ja]??? |’ T Boek ex Taal? 


~ 


Verify that the expressions of part (d) can be rewritten as 


(e 
MSD = p’o?Tr(S Emsa) and EMSE = p202Tr(S Eemse) 


where S = E (ütu;/(e -- u;|?)?) and (Esse Zemse} are the weighting matrices that 
correspond to the vectors Fmsa = (I — F) ^q and Gemse = (I— F) ^A. That is, Esa = 
vec™!(@msa) and Xense = vec”! (Temse). 


Problem V.18 (Performance of «-NLMS with small step-size) In App. 17.A we argued that 
an e-NLMS recursion can be reduced to an equivalent LMS recursion by using the transformations 
(17.10), in which case the e—NLMS update becomes 


Wi = Wi-1 + uà; é(i), where é(i) = d(i) — ü;Wwi-i 


Moreover, the variables (d(i), ui, d(i), ùi} satisfy models (15.16) and (17.11). The variance of à; 
is denoted by Ry = Ewa; = E (utui/(e + ||ui||?)). Assume the step-size is sufficiently small 
and consider again the reasoning in Prob. V.14. 
(a) Using the above LMS update, argue that a simplified variance relation in this case is given by 
(see Prob. V.19 for another approximation): 


l HEN 
yy 


(b) Introduce the eigen-decomposition R, = ÜÁÜ*, where A is diagonal with the eigenvalues 
of R,, and U is unitary. Introduce further the transformed quantities W; = Ŭ* Ùi, Ti = aU, 
and E = U* XU. Verify that the above variance relation becomes 


{ E |: 
x 


E||di-illiy + pP BTE Dile 
X — yXE [ù] ù] — pE [ata] X 


E [lial + WE Isl 
X- pDA — ÅS 


Conclude that 5 is diagonal if È is. 


(c) Leto — diag(X] and À = diag{A}. Verify that the expression for X! reduces to 7’ = Fa, 
where F =I — 25A. 


(d) Follow the arguments that led to Thm. 23.3 to verify that MSD = uo?2M/2, where č? = 
E[&(2)|? = e$E (1/(e + lju:ll?)). 


Problem V.19 (Another approximation for e-NLMS) Consider the same setting of Prob. V.18 
and continue to assume a sufficiently small step-size. 


(a) Use (25.3) to justify the following simplified variance relation: 


ih = Elle + po [ lE 
T = Elöl + woe [Dl 
Y = E-ySXE feta, — uE [ütà;] X 


(b) Verify that the above variance relation can be rewritten as 


E [[;ll 


AD 2-2 KARI 
ET: + w?o2E E 


u 


b» = Y-prA—pAd 
Conclude that 5 is diagonal if € is. 


(c) Define the covariance matrix 


5 A Uy ùi u; ui 
2 g |=) = E | —i | 
& $e (Hite) = E Cerir) 
and justify the following expressions for the MSD and EMSE: 
MSD = uo2Tr(R;! R,)/2, | EMSE = po?Tr(RuLemse) 


where Demse is the unique solution to the Lyapunov equation Ru, Zemse + Lemseftu = Ru. 


Problem V.20 (Alternative independence analysis of NLMS) There is an alternative approx- 
imation method to simplify the analysis of NLMS. Unlike the results we discussed in Secs. 17.1 and 
25.1, this method requires e = 0. Thus consider the NLMS recursion 


* 


wi = win + uu ee e(i) = d(i) - uwi- 

where the data (d(i), ui} are still assumed to satisfy model (22.1) and condition (22.23). In addi- 
tion, there is one more assumption on the distribution of the regressors. Let Ru = U AU* denote the 
eigen-decomposition of Ru. The columns of U are the eigenvectors of Ru and they will be denoted 
by (qi); they all have unit norm. Assume that the regressors uj are modeled as uj = r(i)s(i)hi, 
where the random variables {r (i), s(i), hi} are independent, r(i) has the same probability distri- 
bution as llu;||, s() = +1 with probability 1/2, and Prob(h; = gj) = A;/Tr(Ru). The idea of 
this construction is to assume a regression distribution that simplifies the analysis; it is chosen such 
that the first and second-order moments of the u; so generated coincide with those of the original 
ui. Since for small step-sizes, the filter performance depends on these lower-order moments, the 
approximation tends to be reasonable. 


(a) Show that 
Ehih; = R,/Tr(R,), Ehih;hih] = Ru/Tr(Ru) 


and 
Eh? Rz hi = M/Tr(Ru), Ehih? Ry hih? =1/Tr(Ru) 
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(b) Refer to the variance relation (25.3), with e = 0. Using the assumed model for wi, verify that 
these relations reduce to 


Edi = Elea, + woe iE (ac =) 
vV = E ue - i (Eh;h?)E. + pEhihtEhih? 
= = I RSS 
(c) Assume first E = I. Show that E' = I — p(2 — PTUS Assume now X — Tra) DR. 
show that T 
Eu -I 
u(2 - 1) 


(i.e., verify that X — Y’ = I). 
(d) Take the limit of both sides of the recursion for E ||tb;||2 to conclude that 


~ 12 1 
Elola = sof hbe (ar) 


Tr(Ru) R- 


Use the choice X = m2 — uj P 1 to conclude that the MSD = E ||Ū ||? is given by 


2M 1 
msp = Kog (r) 
2-p lull? 


(e) Choose now © = Trudy, and verify that E — X' = Ru. With this choice for X, conclude 


u(2—u) 
that T(R.) 
Ell? = ot EE hl E (o 3) 


so that the EMSE = E ||too|\®,, is given by 


2 
EMSE = PT at 1 ) 
2- p E (Tut 


Compare with Lemma 17.1. 


Remark. The modei used in this problem for u; has been suggested by Slock (1993), and the resulting perfor- 


mance results agree with those derived by him. 


Problem V.21 (LMS with white Input) Consider the LMS recursion (23.1) and assume the data 
{d(i), wi} satisfy (22.1) and the independence assumption (22.23). Assume further that the individ- 
ual entries of the regressor u; are zero-mean i.i. d with variance c2 and fourth moment £2. In this 
way, the covariance matrix is diagonal, Ry = 021. 

(a) Assume X is diagonal and let c = diag(X]. Verify that the expectation E [||1w:||%af wi] is 
diagonal with entries given by Bo, where B = aqq! + (£4 — o4)I and q is the column 
vector with unit entries. 

(b) Refer to the variance relation of Thm. 22.4, specialized to LMS, and verify that X’ will be 
diagonal if X is. In particular, set © = I and define x? = £4 + (M — 1)c%. Verify that this 
leads to X = (1 — 2402 + i? r?) 

(c) Conclude that the transient behavior of LMS is now described by the one-dimensional differ- 
ence equation E ||i;||? = (1 — 2462 + i? &2)E |i; 1|? + 20202 M. 

(d) Show that the MSD and EMSE of LMS under these conditions are given by 


MSD = poou M 


(e) Assume the individual entries of u; have a Gaussian distribution, so that €4 = 204 (for 
complex-valued random variables; £4 = 30 for real-valued random variables). Verify that 


in this case 


Ei; = (1 — Quon + wou(M + D)Eliboi-il? + wotoiM 


and 
2 2 2 
= poy M - LOO M 
MSD = 2 — uc2(M +1)’ ENSE 2 — no2(M +1) 


(f) Show that the expressions of part (e) agree with what you would obtain from the result of 
Thm. 23.3. 


Problem V.22 (Matrix data-nonlinearities) Consider adaptive filter updates of the form 


wi-wi-ituH[u]wie(i), i20 


e(i) = d(i) - uiwi-i 


where H |.] is some positive-definite Hermitian matrix-valued (as opposed to scalar-valued) function 
of u;. Assume further that the data (d(i), ui} satisfy the modeling assumptions (22.1). 

For any Hermitian positive-definite matrix £, define the a priori and a posterior estimation errors 
ef? (i) = ui H[u;]Ei;-: and e * (i) = ui H[u;] ivi, where i; = w — wi. 


(a) Repeat the derivation of Sec. 22.3 to establish the following energy conservation relation: 


leila: lisi + lez" (I? = uilas lili + lez (QD 


where ||ui||zg = ui(H[u;]EZH [ui]) uf. 


(b) Assume further that the regressors satisfy the independence assumption (22.23). Repeat the 
derivation of Sec. 22.4 in order to extend the result of Thm. 22.4 to the present context, i.e., 
establish the validity of the following variance relation: 


Eli = Elise + KoE |luillzon 


D =D - yXE(H[ujuiu) — uE(ufuiH[u])E + pE (luilirggutu:) 


(c) Verify also that ù; = (I — uH [ui]u? ui) t$i-1 — uH [u;]uiv(1). 


Problem V.23 (Filters with data nonlinearitles) Verify that the algorithms listed below can be 
regarded as special cases of the filter update with matrix data nonlinearities of Prob. V.22 for different 
choices of H[-]. In each case, identify the matrix H. 


(a) LMS with multiple step-sizes: w; = wi-1 + diag(u1, p2,..., um }uz[d(i) — uiwi-i]. 
(b) NLMS with individual power estimates: 
wi = wi-i + udiag (1/pi(i), 1/pa (2), . .., 1/py (i)) ui [d(i) — uiwi-i] 
where each p, (i) is updated as follows: 
Pli) = 8py(i 1) + (1 — B)u(K), O<B<1, pe(-1) =e 
and u;(k) denotes the k—th entry of u;i. 


Problem V.24 (Stability of data-normalized filters) Refer to the discussion on data normal- 
ized filters in Sec. 25.2 and consider the variance relation (25.14). Let 


A uiui A Uului A puil]? . | 
P = E |<=], S=E|—- |, X=E Ui Ui 
E | E LE 


and choose © = I. 
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(a) Verify that the relations (25.14) can be written as 
E lül? = E [esa 2) + w?o2Tr(S), XY 2I-24P + p?X, Ew, = (1-pP)Ew-1 


(b) Assume the moments of w; and the function g{-] are such that X = cP for some constant 
c > 0. This situation arises, for example, while studying the sign-regressor LMS algorithm 


(see Prob. V.25). Show that E ||»; ||? converges if, and only if, p < 2/c. 


Remark. The scenario studied in this problem arises in some studies of LMS itself (see, e.g., Macchi (1995, 
pp. 161-163)). See also App. 9.C of Sayed (2003). 


Problem V.25 (Sign-regressor LMS algorithm) Assume all data are real-valued, and consider 
an adaptive filter update of the form: 


where sign[u;] is a column vector with the signs of the entries of u;. 


(a) Argue that this algorithm can be regarded as a special case of the general update form of 
Prob. V.22 for some matrix data nonlinearity that is implicity defined by the identity sign[u.]" 
H[u]ul. 


(b) Verify that the variance relation of part (b) of Prob. V.22 becomes 


El|&;|$ = Elii-ills + wok [ llsigniu:]lls: ] 
X =D — yXE(sgn[u]' uw) — uE(ulsign[ui])E + p°E [|sign[u;]||Eul u: ] 


(c) Assume that the individual entries of u; have variance c2, and that u; has a Gaussian distri- 
bution. Use Price's theorem (cf. (18.6)) to verify that 


E [sign[u:] ws] = D 


Conclude that the weighting matrix of part (b) reduces to 


2 2 : 
x'zm- IJ zo; EP - mj zog Rud + pE [lisign lees Bur u: | 


Choose È = I and verify that ©’ = I + p (use - 24/ ES ) Ra. 
u 
(d) Use the result of Prob. V.24 to show that E |t»; ||? converges if, and only if, 


Conclude further that 


Remark. The conditions on p for mean-square stability and the expression for the filter EMSE coincide with 
those derived by Eweda (1990a). It should be noted though that replacement of the regressor by its sign limits the 
range of search directions that are followed by the adaptive algorithm and, therefore, performance degradation in 
terms of convergence speed (and even possibly divergence) can occur relative to a plain LMS implementation. 
Actually, it is possible to determine examples of input sequences that can cause sign-regressor LMS to diverge 
while LMS still converges. For more discussion on these issues, see Sethares et al. (1988) and Sethares and 
Johnson (1989). 


Problem V.26 (Price's theorem for complex variables) Let a and b be scalar complex-valued 
zero-mean jointly Gaussian and circular random variables. Let o = E ab. The complex form of 


M 


Price's theorem states that for any function f of (a, b) (for which the required derivatives and inte- 


grals exist), it holds that 
OEf(ab) _ p (^ f(a,b) 
Op Z ðaðb 
where differentiation with respect to a complex variable is defined as in Chapter C. 


(a) Assume the function f(a, b) has the form f(a, b) = ag(b). Verify from Price's theorem that 
Eag(b) = (Eab) - E (dg(b)/db). 


(b) Show further that Eb*g(b) = E |b|? - Edg(b)/db and conclude that the following relation 
also holds: 
Eab 


Eag(b) = EBE 


- Eb*g(b) 
Remark. For more details on Price’s theorem for complex data, see McGee (1969) and van den Bos (1996). 


Problem V.27 (Leaky variance relation) In this problem we extend the energy-conservation 
and variance relations of Thms. 22.1, 22.3 and 22.4 to leaky adaptive updates of the form 


w; = (1 — pawi- du row eli) i20 


e(i) = d(i) — uwi-s 


where a is a positive scalar and g[-] is some positive-valued function of u;. For example, the choice 
g[ui] = 1 results in the leaky-LMS filter. The data (d(i), ui) are still assumed to satisfy model 
(22.1). 


(a) Verify that the weight-error vector, tb; = w? — wi, satisfies the recursion 
y B 


Ù; = (1 — ap)Ūi-1 — p—Se(t) + auw? 
i ( 7 i-1l P grad (2) H 


(b) Equating the weighted norms of both sides of the above equality, for some arbitrary Hermitian 
positive-definite weighting matrix X, verify that 


2 
Ys UP Lá ilu. |i lv(i)? 


g?[u.] 
- deje beo la — au)Ūi-1 — niyeti) + aw] } 


m m ut a 
legi = | (1— auði- - woth ei) epu | 


(c) Use the condition on v(i) from the data model (22.1), and take expectations of both sides of 
the equality of part (b), to establish the variance relation: 


ES - u 
Ele = Elimle + lwl + wore ( - 


+ 2auRe {w° XE [(I — nU ;)ibi-1]) 


where 
* * 2 
E'-(1- ou - u(1— on) DE — (1 ap) MEER + aye wills y, 
( H) u( Hu) nm u( Dou] xir 
and 
Ui al+ bet as. 
gui} 


Verify further that the recursion for the weighting matrix can be rewritten more compactly as 


X'-YX- QU;X- pDU; + w2U,SU; 
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The above variance relation is the extension of Thm. 22.3 to leaky data-normalized filters. 


(d) Now assume that the regressors {u; } satisfy the independence assumption (22.23). Conclude 
that the variance relation of part (c) reduces to the following: 


ju; | 


E|i.i]l] + klw + p2o2E ( 
+ 2apRe {w" “JE dii) 


JEWi-1 + ouw? 
£- p(EU:)E —wX(EU;) + p?E(UiZU;) 


where J = I- 4E (U;) and X’ = E X’. These recursions extend the statements of Thms. 22.4 
and 22.5 to leaky data-normalized filters. 


Remark. The results of Probs. V.27~V.32 are based on the work by Sayed and Al-Naffouri (2001). 


Problem V.28 (Transient performance of leaky filters) Consider the setting of Prob. V.27. 
We now follow the discussion of Sec. 24.1 in order to derive a state-space model that character- 
izes the transient behavior of leaky adaptive filters, in the general case of regressors that are not 
necessarily Gaussian. We do so by appealing to the vec{-} notation introduced in that section. Thus 
let o = vec{S}, i.e., o is an M? x 1 column vector that is obtained by stacking the columns of X 
on top of each other. 


(a) Verify that the expression for X’ from part (d) of Prob. V.27 reduces to the linear relation 


c'2Fo, F Ê E((Iu — uU)! 8 (Iu — nU:)) 


with an M? x M? matrix F. 


(b) Show that the recursion for the mean weight-error vector, E tb;, is stable if, and only if, the 


step-size yz satisfies 
2 


Uru; 
a + Amae (E []) 
Assuming w_1 = 0 and, therefore, E Ùw-ı = w^, iterate the recursion for E Ù; and show 
that E ibi; = Ciw? for i > 0, where C; = J* +ap(I + J+... + Ji^ 1). Conclude that 
2ogRe (w^ LIE wi-1} = euliw? lc, o, r- 


O<p< 


(c) Show that the transient behavior of leaky data-normalized adaptive filters is characterized by 
the following state-space model W; = FW,-1 + wi, where 


P 0 1 
E || wi [2 
~ 412 0 0 1 
E |i]. ua 
A | Elvl2 - 
Wi = V F2.0 EE 
5 0 0 0 1 
Elda i. 
—po —n —p2 ee —PM?-1 M?xM? 
with 
M?-1 


p(z) E det(zI — F) =M" 4 > prz" 
k=0 


denoting the characteristic polynomial of F. Moreover, 


2 
e (HÈ) 
Crs lw? laesi) 
(Se) lw? l auts) Fo 
n 2 
Xi = uoi 9° Tui] +a w. taures y F20 
luil? uoa 2 
: (i lw aurs enti 
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where S; is the M? x M? matrix S; = ((JC;)! @Im) + (Im @ CiJ). 

(d) Verify that F can be expressed as F = I — pA + „?B, where B = E(U] & Ui) and 
A = (E(U;)" 81) + (1@ E(U;)). Show that the adaptive filter is mean-square stable for 
step-sizes in the range 


0<p< min — (ed T 1 
[: + Amax (E [:55]) Amax(A71B)' max (A(H) 23] 


glui 


where 


ia | A/2 —B/2 | 
Im2 0 


(e) How do the results of parts (a)-(d) simplify when the data are real-valued? 


Problem V.29 (Mean-square performance of leaky adaptive fliters) Consider the setting of 


Probs. V.27 and V.28. We follow the discussion that led to the MSD and EMSE expressions in 
Thm. 24.1 to derive similar expressions for the mean-square performance of leaky data-normalized 
adaptive filters. 

(a) Using the fact that in steady-state, 


lim Ell: = lim E|de; il, ^ lim Et; = lim Ed&;i 
1—00 1—0o 1-200 $—0oo 


conclude that 


2 

lim Eð; = ou(1— J) ! w^, lim |E bili ze = H o2E (115) + k ||w" ll, 

where T is M? x M? and givenby T 2 I + ((I— JJ" @1) + (18 (1- 7)3J). 

(b) Use part (a) to establish the following expressions for the MSD and EMSE of leaky data- 
normalized adaptive filters: 


MSD 


Isle e -1 
wore (oes + of pu? lea e 
BRER 


EMSE = pE 
j ( g?[ui] 


2 
) +a p^ lw? lia-r-ir 


where q is the M? x 1 vector q = vec(I) and r = vec{R,} is also M? x 1. 


(c) Verify that the expressions for the MSD and EMSE from part (b) can be rewritten as 
MSD - u^ o Tr(SEmsa) + a? p’ ||w? l Emsa 
EMSE = p?a?Tr(SZemse) + oF p lw? llt s... 


where $ = E (ut u;/g?[ui]) and {Zmsa, Xemse) are the weighting matrices that correspond 
to the vectors amsa = (I— F)^!g and cemse = (I — F)- ir. That is, Emsa = vec" ! (osa) 
and Xemse = vec" ! (Gemse). 
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Problem V.30 (Leaky-LMS with white input) Consider the settings of Probs. V.27-V.29 spe- 
cialized to the case g[u;] = 1, which corresponds to the leaky-LMS update w; = (1 — ou)w; i 
puze(i). In this problem we assume further that the entries of u; are i.i.d. with variance o2 and 


fourth moment £4, so that Ru = o2I. Refer to the variance relation in part (d) of Prob. V.27 and set 
YHA: 


(a) Let k? = o? + 2a02 + £; + (M —1)o%. Show that E' = (1— 2u(a + 02) + u?&?) I, and 
conclude that 


E |; || (1 — 2u(a +02) + g?r?) E iii]? + oy? tw? + 
BooM + 2apu[l — u(at+o2)|Re{w*E w:_1} 


(1— plat o%))Ewi-1 + ow? 


Ew; 


(b) Argue that in steady-state lim;— oo Ew; = Stet w° and, therefore, 


pozsa M uo^ (2 — ula + o2)) 


Mato) -pni * Matot- waza + og] (lr 


o2. MSD 


(c) Assume the individual entries of u; have a Gaussian distribution, so that && E 204 (for 
complex-valued random variables; £4 = 30% for real-valued random variables). Evaluate the 
MSD and EMSE in this case. 


Problem V.31 (Leaky-LMS with Gaussian regressors) Consider the setting of Probs. V.27- 
V.29 specialized to the case g[u;] = 1, which corresponds to the leaky-L MS update 


wi = (1—ap)wi-1 + uuie(i) 


In this problem we assume further that the regressors u; are circular Gaussian, and follow the ar- 
guments of Sec. 23.1 in order to evaluate the performance of leaky-LMS. Thus introduce the eigen- 
decomposition R, = UAU*, where A is diagonal with the eigenvalues of R, and U is unitary. 
Define further the transformed quantities: 


uU, E Eq 
U*C;U 


Uh, W 
U'U;iU J 


U*w?, ui 
U*JU, Ci 


I> Ie 
up. "p 


lip. I> 


as wellas A° Ê A-r al with diagonal entries {àF = A; + a}. The quantities (Ui, J, Ci} are 
defined as in Probs. V.27- V.28. 


(a) Verify that E (U;EU;) = A?ZA? + ATr(ZA). Now refer to the recursion for X’ from 
part (d) of Prob. V.27 and show that it leads to 


X 2E-,A*E- Q4EA? + p? [A?EA* + AT(XA)] 
Verify that 5’ will be diagonal if E is. 


(b) Let F = diag{5} denote the M x 1 vector with the diagonal entries of ©. Likewise, let 
A = diag(A) and A* = diag(A?). Show that the above relation for D’ leads to a’ = Fo, 
where F now denotes the M x M matrix (compare with the expression for F in (23.22) for 
the non-leaky LMS case): 


FS (I — 24A? + p?(A%)?) + p2aat 


(c) Verify that the matrices J and C; are diagonal. In particular, 


J-E(I- uU;) = I— pA? 


Conclude that the variance relation from part (d) of Prob. V.27 leads to 203 
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E |T; + PRAT) + aT aur 
JE@Wi-1 + apw? 


(d) Show that the transient behavior of leaky-LMS is characterized by the following state-space 
model of dimension M: 
Wi = FWi-1 + Yi 


where 
E |, ||2 EE E 
E [fili 
A a 2° 0 0 1 
w ê |t, |, F= 
] 0 0 1 
Ee i 
EYE —po -pı -p2 «+» —PM-1 lwxM 
with 
à M-1 
p(z) = det(zI— F) = zM + > prz” 
k=0 
denoting the characteristic polynomial of F. Moreover, 
2 
Ne lw? lta cac s 
= 2 
Al Fo Jw’ l aur+270:)FF 
2| ATF'g lw? |? 
»i- Bev g + @ (apl+296 Fee 
NPs 


o2 
lie ltau-2J8)FM-1z 


(e) Express F as F = I — pA + 2B, where A = 2A? and B = p? (A9)? + AAT. Follow 
the arguments that led to (23.41) to conclude that the leaky-LMS filter is mean-square stable 
if, and only if, the step-size jz is chosen such that 


Yu <i 
e X — uu) 


Problem V.32 (Performance of leaky-LMS with Gaussian regressors) Consider the same 
setting as Prob. V.31. Here we wish to follow the discussion that led to the results of Thm. 23.3 in 
order to determine expressions for the MSD and EMSE of leaky-LMS under Gaussian regressors. 


(a) Use the variance relation of part (c) of Prob. V.31 to verify that, in steady-state, 


Ell? = KAI- F)719) + op ho leasa--9a-5)-1 


where q is the M —dimensional column vector with unit entries. 


(b) Show that 
A'D tq 


T cmd p 
MEO 4 = ADTA 


404 where D = 24A? — p? (A?)?. Conclude that 
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where W° (k) denotes the k—th entry of W°. 
(c) In a similar vein, verify that 
EMSE = poA (L- F) 7A + pD llega 25-00-14 


and simplify the expression to 


ORAR of (zd - ES 


M 
à XI — wre] 
EMSE = p 


M A2 
L-UE seme 


Problem V.33 (Tracking performance of LMS) The energy conservation argument can also be 
used to study the transient performance of adaptive filters in nonstationary environments. We illus- 
trate this fact by considering the LMS algorithm 


wi = wi- + pui[d(i) — uiwi.i] 


where the data (d(i), w;) are now assumed to satisfy the nonstationary model (20.16) instead of the 
stationary model (22.1), namely 


(1) There exists a vector w? such that d(i) = uiw? + v(i). 

(2) The weight vector varies according to w? = wf_, + q; 

(3) The noise sequence (v(i)) is i.i.d. with variance o2 = E |v(i)?. 

(4) The noise sequence v() is independent of u; for all i, j. 

(5) The sequence q; has covariance Q and is independent of (v(7), uj} for all i, j. 
(6) The initial conditions {w_1, w21} are independent of all {d(j), u;, v(j), q;). 
(7) The regressor covariance matrix is Ry = Eu; v; > 0. 

(8) The variables (v(i),u;, q;) have zero means. 


Assume in addition that the regressors (u;) satisfy the independence condition (22.23). The argu- 
ments we employed in Sec. 22.3 and Chapter 24 can be extended in a rather straightforward manner. 


(a) Establish the following variance relation (which extends the result of Thm. 22.4 in the LMS 
case to nonstationary models): 


E |jabillé Eð: + wos Tr(RuZ) + Tr(QE) 
E = E- yXEuiu — uEujwE + wElluil[zulu; 


We thus see that an additional driving term appears in the first recursion, namely, Tr(Q3)) = 
E lla. lis. 


(b) Assume further that the regressors (1;) are circular Gaussian. Argue that the same condition 
on the step-size ys from Thm. 23.2 is necessary and sufficient for mean-square stability of the 
filter. 


(c) Define d; = U"g;, whose covariance matrix is Q = U* QU. Extend the result of Thm. 23.3 
to this case and show that the MSD and EMSE are given by 


EMSE T 
A 
1-4 zx 
wos DO 3-3 
MSD = ——— —— 


at (Qdiag {4 }) 


+ H P 
L-A rA. 
MC 


M x 
_ EE NN 
E H2 uke 


(d) Assume Q — c?l. Verify that the expressions of part (c) simplify to 


M ,-152 
2—uAÀ, 


> A og tuo Ak 


$ plot 4 uo2 A, 
2Xk MAS 

k=1 k 

MSD = ————————— 


Ak 
"s HÈ um 


Problem V.34 (Tracking performance of leaky-LMS) The purpose of this problem is to ex- 
tend the results of Prob. V.33 to leaky-LMS, 


wi —-(1-og)wi-i + puje(i) 


Thus assume that the data (d (i), u; } satisfy the nonstationary model of Prob. V.33 with the exception 
of item (2), which is replaced by w? = w° + q; for some unknown vector w°. Assume in addition 
that the regressors u; satisfy the independence condition (22.23). 

(a) Verify that the variance relation of part (d) of Prob. V.27 should be modified as follows (using 


g[ui] = 1: 


Eü + p lwli + KoE luli + Tr(QU) 
2apRe (w^* JE ài) 


JEWi-1 + auw? 
E- g(EU;)E — pX(EV;) + 57 E(U;XU.:) 


That is, an additional term Tr(QX) appears in the recursion for E ||iv; ||2.. 
(b) Assume further that the regressors u; are circular Gaussian and consider the transformations 
of Prob. V.31. Repeat the arguments of Prob. V.32 to conclude that 
E 


ood + 0? (s PA) 


M 
» XB - wae 
MSD = y 


M A2 
1-45 eg] 


A2 M X 
= —— ——— - 
B Li Bb 


ge) 
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(c) Assume Q — c? Verify that the expressions of part (b) simplify to 


uk (Et E Mut de (aig -1) moo? 
e Xr — ug] 
MSD = M 
A2 
1- nl SERRE 
and 
x uo + uo$X + no?M. (zig - ) eq)? 
Š Xt - wre] 
EMSE = 


M A2 
1a BAD 


Problem V.35 (Mean performance of RLS) Consider the RLS algorithm 


A? Pi-utuiPi- 
ao het (NEP [py ae ee ee Ete 
Pi À [Pi 1 +d71u,Pi-1u? | 
wi = wii-itP;: u; [d(i) — uiwi-1], i20 


and recall from the discussion in Sec. 14.1, and especially from recursion (14.4), that 
PS =AP l +ulu, Plp=4 


(a) Let ip; = w? — w; where we are assuming the data (d(i), u;) satisfy model (22.1). Show 
that 
Py Hi = APL Wi-1 + ul v() 


and, conclude that, (assuming w—1 = 0) 


i 
Prii = Meu? + YOA uzol) 
j=0 
(b) Verify that 
E Wi = edit} (E P;)w° 
Assuming 0 < A < land E P; > 0 asi — oo, conclude that Ew; — 0 as à — oo. In other 
words, conclude that the exponentially-weighted RLS algorithm is asymptotically unbiased. 


Problem V.36 (Mean-square performance of RLS) The class of adaptive filters with matrix 
data nonlinearities (as described in Prob. V.22) can be generalized to include the RLS algorithm as 
well: 


A Piuu: Pi- 

S a | pes a m e at 

P: 4 [Pas 1IrA-uPi.uf 
wi = wiaictPiui[d(i-wiwi-i, i20 


Thus assume that the data {d(i), ui} satisfy (22.1). Assume further that the regressors satisfy the 
independence assumption (22.23). 
In Prob. V.22 we studied the following class of algorithms: 


wi = wi-1+pHluiluje(i), i20 


e(i) = d(i) — UiWi-1 


where H[] is a Hermitian positive-definite matrix of u;. If we set y = 1 and allow H[-] to be 
dependent not only on u; but also on all prior regressors, then the RLS update becomes a special 
case (since P; is determined by the regressors (ux, k < i) and we can take H[.] = P). 


(a) 


(b) 


(c) 


(d 


wa 


(e) 


Use arguments similar to Prob. V.22 to conclude that the following relation holds for RLS: 


E|i;£ = E (j&i) + e?ElluillP;zp, 


X'-X - XPiuiu; — utuiPiX + |uilP;zp,ului 


In Chapter 19, while studying the mean-square performance of RLS, we argued that the de- 
pendence of P; on present and past regressors makes the analysis challenging. For this rea- 
son, whenever necessary, we replaced the random variables (P; !, Pi) by (cf. (19.11) and 
(19.12)): 


le 


= : —1Y _ Ru 
P^» me (Pr) = 5. 


Q 


EP; P = (1-A)Rz' 


Use these substitutions to verify that the relation of part (a) becomes 


E || = E löi- + 03 (1 — APE luill iss 


Y=} — 21-A)E + (3-A?E [luii 


* 
d 


Assume now that the regressors (ui) are circular Gaussian. As in Sec. 23.1, introduce the 
eigen-decomposition R,, = U AU", and define the transformed variables W; = U*w;, ui = 


u;U,andYX Ê U"*XU. Verify that 
E [Iu;]i-:gA-:u;w;] = ATr(A EX) + E 


and conclude that the variance relation of part (b) becomes 


E|Rz;|le = E [Weal + oll- AERE igi 
E 2 XX + (1-27AT(A7X) 


Conclude that 5 will be diagonal if X is diagonal. 


Define the M —dimensional vectors b = diag{A}, a = diag(A^!), and v = diag(Z). 
Observe that, contrary to the notation in the previous problems and in the chapter, we are 
denoting the vector corresponding to A by b in order to avoid confusion with the forgetting 
factor À for RLS. Verify that the recursion for X reduces to 


v -Fo, F 5$ 71+(1-2)7b0" 


Let q denote the column vector with unit entries. Show that 


s 1 T(A71)A 
diag (I~ F) 7g) = ——5 l t a 
T= x M 


Now choosing © = I and considering the steady-state value of the recursion for aloe 
conclude that 


Show also that 


2 
ogM 

EMSE = = 
i- M 


Compare with the EMSE expression from Lemma 19.1. 
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(f) How do the EMSE and MSD expressions of part (e) simplify when A — 1? 


Problem V.37 (Transient performance of «—APA) In this problem we study the transient per- 
formance of the e—APA algorithm (21.53)-(21.55) with K < M. Thus introduce the weight-error 
vector, and the weighted a priori and a posteriori error vectors, Ù; = W? — wi, eZ, = UiXüiai 
and eZ, = UiXWi, where X is any Hermitian positive-definite weighting matrix. 


(a) 


(b 


wm 


(c) 


(d) 


(e) 


Verify that 16; satisfies the recursion 
$; = (I- pU} (e + U;U7) U) i-1— pU; (d - UiU1) ^vi 
where v; = col(v(i), v(i — 1),...,v(i K +1)}. 


Follow arguments similar to those in Sec. 22.3 to establish the following weighted energy- 
conservation relation: 


This relation is the extension of Thm. 22.1 to the context of affine projection algorithms. 


Follow further the arguments of Sec. 22.4, and ignore dependencies between v; and {€,,;, eZ, ; 
to establish the following variance relation, which extends the result of Thm. 22.3: 


E [w|i = E (li-l) +p Evi Atvi 
where X’ is the random matrix 
B' = D- uIP; - uP: +p? (U7 AFU.) 
and {P;, AT) are defined by 
P, Ê UtNd+uUwt) Ui, AP $ (d+U:U} UU Nd +U Ut) 


If we were to extend the independence assumption (22.23) to the current context, we would 
require the matrix sequence {U;} to be i.i.d. This assumption would then guarantee that 
t;~-1 is independent of X’. However, requiring the sequence (U;) to be i.i.d. is a strong 
condition (actually more so than the usual condition (22.23) since each U; consists of suc- 
cessive regressors). However, it can be seen from the expression for X’ that it is sufficient for 
our purposes to require 1b;-; to be independent of P; this can be a weaker condition and 
more likely to hold. As an illustration, consider the special case in which U; is square and 
invertible with e zz 0. In this case, P; = I and is independent of w,;_1. Use this assumption 
to verify that the variance relation of part (c) becomes 


E[[i;|$ = E||ii-illi; + WE v7 AFv: 


E'zX- uX(EP:) - w(EPi)D+ pj] E (Uj APU] 


where X is now a deterministic matrix. The above variance relation extends the result of 
Thm. 22.4 to e—APA. Show further that the mean of the weight-error vector satisfies 


Ew; —[I- pE (P;)] E d: 


As in Sec. 24.1, introduce the vec(-) notation, c = vec(X). Argue that the variance relation 
of part (d) can be rewritten in the form 


E||& = Ell&i-ill bo + uolo) 
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where y = vec(EU 7 (el + U;U)—?U;} and 


(f) Conclude that the transient performance of e—APA is described by the M ? dimensional state- 
space recursion W; = FWi-1 + u?o2y where 


0 1 3 
E XA Elöl 
e A EN E ule 
F= : W; = E ||willp2, 
0 0 0 1 - 
E ||: iua s, 
—po —pi —p2 ++» —PM?-1 d M2xM2 
with 
à M?-1 
p(x) 2 det(zl- F)-z" + ye per” 
k=0 


denoting the characteristic polynomial of F and [Y], = y'F*o for k = 0,..., M? — 1. 


(g) Conclude further that e—APA is stable in the mean and mean-square senses for step-sizes in 
the range 


2 1 1 
0 in) a oo oT 
S 5 TM í Amax(E Pi) Amax(A7!B) max {A(H) € R*) } 


where 


A-(E[P]eI «(IS E[P,]), B=E[P:@ Pil, 2-25 pu 
M? 


(h) Deduce that the MSD and EMSE of e—APA are given by 
MSD = Jj?ci.,'(I- F)^vec(I) and EMSE- p?°o?y (I — F)^!vec(R.) 
Remark. The results of this problem are based on the work by Shin and Sayed (2003a,b). 


Problem V.38 (Stable LMS with small step-size) Consider the setting of Thm. 24.1 where F — 
Luz — pA + i B, with 


A= (ly S[Eutu]) + (Euu SIm),  B-E (lutw]' uiui) 


This theorem relates to the performance of LMS for non-Gaussian regressors. As explained in the 
arguments that led to the theorem, the key difficulty lies in evaluating the last moment in (23.6), 
which translates into a difficulty in evaluating the matrix B itself. The purpose of this problem is 
to show that, although we may not be able to evaluate B, if the distribution of u; is such that B is 
finite, then we can guarantee the existence of a small enough step-size for which F (and, hence, the 
filter) is stable. To establish the result we use the property of Kronecker products stated in Prob. V.6, 
as well as the characterization (25.29) for the maximum eigenvalue of Hermitian matrices. 


(a) Verify that the eigenvalues of Ijj2 — pA are given by 1 — (Ax + Aj) forall 1 € j,k € M. 


(b) Assume that B is finite. We mean that B is finite relative to some matrix norm. One such 
norm could be the Frobenius norm of B, which is defined as the Euclidean norm of vec( B). 
Argue that the maximum eigenvalue of F satisfies Amax(F) € 1 — 2uAmin + 27, for some 
finite positive scalar 8. Conclude that there exists a small enough yz such that F' is stable (and, 
hence, the filter is mean-square stable). 
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Problem V.39 (Stable data-normalized filters) Consider the setting of Thm. 25.2 where F = 
Iu2 — pA + p’ B, with 


*a ll HT *uill *ap; 
a- (EsTe) « (esi). on (hos «li 
gui] giu) g[ui] g[u;] 
Let P = Euiui/g[u;]. The matrix P is clearly non-negative definite. Assume P > 0 so that its 


minimum eigenvalue is positive. Assume further that B is finite. Repeat the argument of Prob. V.38 
to show that there exists a finite positive scalar 8 such that A(F) < 1 for any p < 2Àmin (P)/8. 


COMPUTER PROJECT 


Project V.1 (Transient behavior of LMS) In this project we examine the transient behavior of 
LMS and verify some of the results derived in the chapter for both cases of Gaussian and non- 
Gaussian data. 

Thus consider a real-valued regression sequence (u;) with covariance matrix Ru whose eigen- 
value spread we set at p = 5. Let the noise variance be c2 — 0.001 and fix the filter order at 
M — 5. 


(a) Generate a covariance matrix Ru whose eigenvalue spread is 5. This can be achieved, for ex- 
ample, by choosing R, to be diagonal with smallest eigenvalue at 1 and largest eigenvalue at 
5. Generate independent and identically distributed regression vectors {u,} from a Gaussian 
distribution with covariance matrix Ru. For each time instant 7, generate also the reference 
signal as follows: 

d(i) = uiw? + v(i) 

where v(i) is white Gaussian noise with variance 0.001, and w° is some arbitrary weight 
vector that we wish to estimate. Adjust the norm of w° to unity. 


(b) Fix initially the step-size at u = 0.01 and train LMS for 10000 iterations. Average the 
squared-weight-error curve, ||i;l|?, over 30 experiments and generate an ensemble-average 
curve for E ||»; ]|?. Use recursion (23.32) to generate the theoretical curve for E ||»; ||?. Ob- 
serve that since in this project we are choosing Ru to be diagonal and, hence, Ry = A, it 
holds that 15; = 1b;. Compare the simulated and theoretical curves. Use the simulated curve 
to estimate the MSD. Compare this value with the one predicted by theory through expression 
(23.54). 


(c) Repeat the simulation of part (b) and generate a curve of the MSD performance as a function 
of the step-size. Simulate for the values 


p € { 0.03, 0.05, 0.07, 0.08, 0.09, 0.095, 0.1, 0.125) 


At what value of the step-size does the filter become unstable? Now recall that according 
to Thm. 23.2, the LMS filter remains mean-square stable for all step-sizes that satisfy the 
condition 

i<— Aki 


fu) 25 


«1 
2£41-uX 


Plot the function f(u). At what value of u does f(u) exceed one? Compare this theoretical 
value with the simulated value. 


(d) Repeat the simulation of part (b) except that now the regression vectors {u;} are selected as 
independent and identically distributed realizations of a uniform distribution with covariance 
matrix Ru. More specifically, each u; is generated as follows. Select first random vectors si, 
of the same size as wi, but with i.i.d. entries that are uniformly distributed within the interval 
[—1, 1]. In this way, each entry of s; has variance 1/12. Then set 


ui = VIZ. s; RP? 


where, since R, is diagonal, Ri? is the diagonal matrix with the positive square-roots of the 
entries of Ru. Verify analytically that the vectors u; generated in this manner have covariance 
matrix Ry. 

Generate also a uniform noise sequence v(i) with variance c2. = 0.001. Use the data 
{d(z), ui} so obtained to simulate the operation of LMS with step-size y = 0.05. Use the 
recursion in Prob. V.7 to generate the theoretical curve for E |ib; ||. Compare the simulated 
and theoretical curves. Construct also a curve for the MSD performance of the filter for the 
same range of step-sizes as in part (b). At what value of the step-size does the filter become 
unstable? Now recall that according to Thm. 24.1, the LMS filter remains mean-square stable 
for step-sizes that satisfy condition (24.24), namely, 


O<u< sni. EOLO 
where {A, B} are given by (24.21); for real-valued data, these matrices reduce to 
A= (Im Ru) + (Ru@Im), B=E ([ulus] @ [ulu]) 
and 


Iyu2 0 


Estimate the matrices A and B via ensemble-averaging and evaluate the upper bound on p 
for mean-square stability. Compare this result with the one obtained from the simulated MSD 
curve. 


Biz | A/2 -B[2 | 
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Transform Domain Adaptive Filters 


Te convergence performance of LMS-type filters is highly dependent on the correlation 
of the input data and, in particular, on the eigenvalue spread of the covariance matrix of 
the regression data. In addition, the computational complexity of this class of filters is 
proportional to the filter length and, therefore, it can become prohibitive for long tapped 
delay lines. The purpose of this part is to describe three other classes of adaptive filters 
that address the two concerns of complexity and convergence, namely, transform-domain 
adaptive filters, block adaptive filters, and subband adaptive filters. 

Transform-domain filters exploit the de-correlation properties of some well-known sig- 
nal transforms, such as the discrete Fourier transform (DFT) and the discrete cosine trans- 
form (DCT), in order to pre-whiten the input data and speed up filter convergence. The 
resulting improvement in performance is usually a function of the data correlation and, 
therefore, the degree of success in achieving the desired objective varies from one sig- 
nal correlation to another. The computational cost continues to be O( M) operations per 
sample for a filter of length M. 

Block adaptive filters, on the other hand, reduce the computational cost by a factor 
a > 1, while at the same time improving the convergence speed. This is achieved by 
processing the data on a block-by-block basis, as opposed to a sample-by-sample basis, 
and by exploiting the fact that many signal transforms admit efficient implementations. 
However, the reduction in cost and the improvement in convergence speed come at a cost. 
Block implementations tend to suffer from a delay problem in the signal path, and this 
delay results from the need to collect blocks of data before processing. 

The class of subband adaptive filters is related to the class of block adaptive filters, 
except that it attempts to achieve better pre-whitening (or band partitioning) of the data via 
selection of what are called prototype filters for their analysis and synthesis filter banks. 
While subband filters also succeed in reducing the computational cost by a factor a > 1, 
their convergence and mean-square performance can be less than that of block filters. This 
is because the design of the analysis and synthesis filter banks is usually decoupled from 
the adaptive filter design and, in this process, performance degradation can occur. 

In this part, we study in some detail these. classes of algorithms and comment on some 
of their additional properties. We start with transform-domain filters. 


26.1 TRANSFORM-DOMAIN FILTERS 


Recall from the discussions in App. 23.A, Chapter 24, and also Prob. V.14, that the per- 
formance of LMS is sensitive to the correlation of the input sequence {u(i)} and, more 
specifically, to the eigenvalues and eigenvalue spread of the covariance matrix 


R, = Eu;ui 
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where 
u-[u() ufi-1) ... ui-M+4+1) ] 


denotes the regression vector. The smaller eigenvalues of E, contribute to slower conver- 
gence and the larger eigenvalues limit the range of allowed step-sizes and, thereby, limit 
the learning abilities of the filter. Best convergence and learning performance would result 
when all the eigenvalues of R, are equal. This situation requires the input data to be white 
so that R, would have the form R, = o?I for some variance c2. 

However, in general, the input data is colored and the eigenvalues of R, can vary signif- 
icantly in value from the smallest to the largest. One strategy to improve filter performance 
is to attempt to pre-whiten the data prior to adaptation. If the auto-correlation of the input 
sequence is known, then we could use this information to construct a filter that pre-whitens 
the data, as we shall explain shortly. However, since the statistics of the input data are sel- 
dom known beforehand, the design of such pre-whitening filters is generally not possible. 
Still, there are ways to achieve this objective in an approximate manner. One such way 
is to transform the regressor prior to adaptation by some pre-selected unitary transforma- 
tion, such as the discrete Fourier transform (DFT) or the discrete cosine transform (DCT). 
These transforms have useful de-correlation properties that help improve the convergence 
performance of LMS for correlated input data. 


Pre-Whitening Filters 


We first explain how to pre-whiten the data when the second-order statistics of the input 
sequence (u(i)) is known. Thus, assume that (u(i)) is zero-mean and wide-sense sta- 
tionary, with known auto-correlation function 


r(k) = Eu(iju*(i— k), for k=0,+1,+2,... 


In order to determine the pre-whitening filter from knowledge of {r(k)}, we need to ex- 
plain briefly the useful notions of power spectrum and spectral factorization. 


Spectral Factorization 
The z—spectrum of a wide-sense stationary random process {u(i)} is denoted by S,,(z) 
and is defined as the z-transform of {r(k)}, 


oc 


Su(z) Ê ie (26.1) 


=-coO 


Of course, this definition makes sense only for those values of z for which the series con- 
verges. For our purposes, it suffices to assume that (r(k)) is exponentially bounded, i.e., 


Ir(k)| < Bal*l (26.2) 


for some 8 > 0 and 0 < a < 1. In this case the series (26.1) is absolutely convergent for 
all values of z in the annulus a < |z| < a}, i.e., it satisfies 


oo 
D \r(k)|-|z7*| < œ forall a< |z| « 1/a 
kz-—oo 


We then say that the interval a < |z| < a^! defines the region of convergence (ROC) of 
S, (z). Since this ROC includes the unit circle, we establish that S, (z) cannot have poles 


on the unit circle. Evaluating Sy (z) on the unit circle then leads to what is called the power 
spectrum (or the power-spectral-density function) of the random process (u(i)): 


Sule”) È S r(k)eciek (26.3) 


k-—oo 


The power spectrum has two important properties: 
1. Hermitian symmetry, i.e., S, (e^) = [S,(e%”)]* and S, (e^) is therefore real. 
2. Nonnegativity on the unit circle, i.e., S,(e%) > 0 for0 < w < 27. 


The first property is easily checked from the definition of S, (e/^) since 
r(k) 2 r'(—k) 


Actually, and more generally, the z—spectrum satisfies the para-Hermitian symmetry prop- 
erty 
Salz) = [Sy (1/2*))* (26.4) 


That is, if we replace z by 1/z* and conjugate the result, then we recover S,,(z) again. The 
second claim regarding the nonnegativity of S (e°) on the unit circle is more demanding 
to establish; it is proven in Prob. V1.2 under assumption (26.2). 

To continue, we shall assume that S,,(z) is a proper rational function and that it does 
not have zeros on the unit-circle so that 


Sule’) >0 forall -t<w<r (26.5) 


Then using the para-Hermitian symmetry property (26.4), it is easy to see that for every 
pole (or zero) at a point £, there must exist a pole (respectively a zero) at 1/£*. In addition, 
it follows from the fact that S, (z) does not have poles and zeros on the unit circle that any 
such rational function Su (z) can be factored as 


e- 20671 — af) 
LI MR e (26.6) 


I (z — p)(271 — pf) 


Sulz) = o$: 


en 


for some positive scalar 2, and for poles and zeros satisfying |z;| < 1 and |p;| < 1. The 
spectral factorization of S,,(z) is now defined as a factorization of the form 


Su(z) = A(z) [A(1/2*)]* (26.7) 
where (02, A(z)} satisfy the following conditions: 
1. c2 is a positive scalar. 
2. A(z) is normalized to unity at infinity, i.e., A(oo) = 1. 


3. A(z) is a rational minimum-phase function (i.e., its poles and zeros are inside the 
unit circle). 


The normalization A(oo) = 1 makes the choice of A(z) unique since otherwise infinitely 
many choices for (c2, A(z)} would exist. In order to determine A(z) from (26.6) we just 
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FIGURE 26.1 Filtering of a wide-sense stationary random process {a(i)} by a stable linear 
system H(z). 


have to extract the poles and zeros that are inside the unit circle. However, in order to meet 
the normalization condition A(oo) = 1, we take 


(26.8) 


The spectral factor A(z) so defined has a useful interpretation. To see this, we first re- 
call the following result. Let {a(z)} be a wide-sense stationary random process with 
z—spectrum S; (z), and assume it is fed into a stable linear system with transfer function 
H(z), as shown in Fig. 26.1. Then the output process {y(i)}} is also wide-sense stationary 
and it can be verified, by direct calculation, that its z -spectrum is related to (S, (z), H(z)} 
via 

Sy(2) = H(2)Se(2) [H(1/2*)]" 


Therefore, returning to A(z), if we feed the process {u(i)} through the filter 1/[o,,A(z)], 
as shown in Fig. 26.2, then the z—spectrum of the output process, denoted by (u(-)), will 
be 


1 1 1 aci 1 m 
Sal2) = TAG P raj = e. A) [c2 A(2)A (z )] c, A (277) =l 


In other words, the process (ü(-)) becomes white with unit-variance and, by inverse 
z—transformation, its auto-correlation sequence will be F(k) = Eu(i)u*(i — k) = ó(K). 


LMS Adaptation 

Consider now an LMS filter that is adapted by using {%(z)} as regression data, instead of 
{u(i)}, as shown in Fig. 26.3, with the reference sequence d(i) also filtered by 1/(c, A(z)), 
i.e., 


i = Ñi- + pure), e(i)-d(i-uj;, D-1 =0 (26.9) 


where 


üi = [ü(i) a(i- 1) ... ali- M +1) 


1 


JAAG ali) 


u(i) 


FIGURE 26.2 Pre-whitening of {u (i) } by using the inverse of the spectral factor of Su (2z). 


FIGURE 26.3 An adaptive filter implementation with a pre-whitening filter. 


denotes the regression data and dv; denotes the resulting weight vector. In Fig. 26.3, it is 
assumed that the data (d(i), u;) satisfy the linear regression model d(i) = uw? + v(i), 
for some unknown w°. The covariance matrix of the transformed regressor is seen to be 
Ra = Eufüj = I, with an eigenvalue spread of unity. In this way, the convergence 
performance of the filter will be improved relative to an LMS implementation that relies 
solely on {d(i), w(z)}. We illustrate this fact later in Fig. 26.5. 

As an example, consider a random process (u()) with an exponential auto-correlation 
sequence of the form 


r(k) = al^, Bm gee Ole, (26.10) 


with |a| < 1. Its z—spectrum is given by 


" "E 
Sy(z) = 2 Bree (1 — az-1)(1— a*z) 


k=—0o 


and the corresponding spectral factor is therefore A(z) = 1/(1 — az^!) with o2 = 
1 — |a|*. This result shows that a random process with exponential auto-correlation func- 
tion of the form (26.10) can be whitened by passing it through the first-order FIR filter 


(1— az D/(4/1 - laj?. 


Unitary Transformations 

Since the statistics of the input data are rarely known in advance, and since the data itself 
may not even be stationary, the design of a pre-whitening filter 1/A(z) is usually not pos- 
sible in the manner explained above. Still, there are ways to approximately pre-whiten the 
data, with varied degrees of success depending on the nature of the data. One such way is 
to transform the regressors prior to adaptation by some pre-selected unitary transformation, 
such as the DFT or the DCT, as we now explain. 

Consider the standard LMS implementation 


Wi = Wi-1 + uu; [d(i) — UjWi-j (26.11) 
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with regressor u; and associated covariance matrix R, = Eutu,;. Let T denote an ar- 
bitrary unitary matrix of size M x M, i.e., TT" = T*T = I. For example, T could be 
chosen in terms of the discrete Fourier transform matrix (DFT), 


(Flam à ae T bma Mei (OFT) (26.12) 


or the discrete cosine transform matrix (DCT), 


(Clam = a(k) eos (S8 2), k,m=0,1,...,M@—1 (DCT) (26.13) 


where 


o(0)—1/VM and o(k)- J2/M, for k #0 


Also, k indicates the row index and m the column index. The scaling factor 1//M in 
the expression (26.12) for the DFT matrix F is added here in order to result in a unitary 
transformation since then F satisfies FF* = F*F = I. Moreover, there are other variants 
of the discrete cosine transform; the one we choose is widely used. There are also discrete 
sine transforms (DST). The arguments in this section apply to other unitary transforms as 
well. Observe that F is symmetric (FT = F) while C is real but nonsymmetric; yet both 
are unitary. We usually choose T = F or T = C", or some other unitary matrix. Once T 
is selected, we then define the transformed regressor 


u; = uil 
whose covariance matrix is related to R,, via 
Ra = Eü;ü; = T* (Euzu) T 


that is, 


Ra =T*R,T (26.14) 


- * 
b; =T Wi 


If we further let 


and if we multiply both sides of (26.11) by T* from the left, then (26.11) can be rewritten 
in terms of the transformed weight vector as (see Fig. 26.4): 


dp = Wi-1 + 1; [d(i) = üji. i]. [E 0 (26.15) 


since u;w;.; = Ui. i in view of the unitarity of T. Observe that the reference sequence 
{d(i)} remains unchanged. Recursion (26.15) is a standard LMS implementation with re- 
gression data à; instead of u;. However, since T is unitary and therefore T* = T'—!, the 
relation between {R,,, Ra} has the form of a similarity transformation and such transfor- 
mations are known to preserve eigenvalues.’ This means that { Ru, Ra} will have the same 
eigenvalues and the same eigenvalue spread. Consequently, the implementation (26.15) 
will face similar convergence limitations as the implementation (26.11). More needs to be 
done in order to achieve improvement in performance. 


9?Two square matrices A and B are said to be similar if they are related via B = 771 AT for some invertible ma- 
trix T. Similarity transformations preserve eigenvalues and, hence, both A and B wil] have the same eigenvalues 
— see Prob. VI.1. 


FIGURE 26.4  Transform-domain adaptive filter implementation, where T is generally a unitary 
transformation. 


So assume that we replace the LMS update (26.15) by 
@;, = Wi-1 + wD aj [d(i) - titi], 41-20 (26.16) 


where, in addition to using the transformed regressor à;, we also employ a diagonal nor- 
malization matrix D with positive entries to be determined. We can regard (26.16) as an 
LMS update with a matrix step-size that is equal to D71. Let D1/? denote the diagonal 
matrix whose entries are the positive square-roots of the entries of D. If we multiply both 
sides of (26.16) by D!/? we find that it reduces to 


where ^ " 

w, 2 Do,  w, $ aD = uT D 
Recursion (26.17) shows that the transform-domain algorithm (26.16) is equivalent to an 
LMS implementation with regression data (u/). Consequently, the performance of (26.16) 
will be similar to that of an LMS filter with regression covariance matrix 


Ry = Euu = D7Y?T*R, TD"? (26.18) 


The point to note now is that the relation between Rw and Rẹ is not a similarity transfor- 
mation any longer and, therefore, their eigenvalues and eigenvalue spreads are generally 
different. An ideal choice for D would be to try to force Ry in (26.18) to become the 
identity matrix (or a multiple of the identity). 
Let Ra = UAU* denote the eigen-decomposition of £,, and assume we choose T and 
Das 
T=U and D=A (26.19) 


Then Ra = A and Rw = I, i.e., these choices of T and D de-correlate the entries of à; 
and the variances of the individual entries of à; are the (A), i.e., the eigenvalues of Ru. 
This choice of T' is known as the Karhunen Loéve transform (KLT). However, using the 
KLT is not practical since it requires knowledge of R, (in order to construct T and D) and 
this information is generally lacking in adaptive implementations. 
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The alternative would be to replace T by a unitary transformation that does not require 
prior knowledge of Ru. We often choose T as the DFT or DCT matrices, T = F or 
T = C" from (26.12)-(26.13). Of course, applying the DFT or the DCT to u; does not 
generally result in a transformed regressor à; with a diagonal covariance matrix Ra.The 
degree of success in transforming R, to close-to-diagonal by F or CT is dependent on 
R,, itself; the transformations can be more successful in some cases and less successful in 
others. In general, these transformations tend to result in covariance matrices Ra that are 
reasonably close to diagonal, i.e., 


Ry =T*R,T z diagonal 


In this way, the entries of the resulting regressor à; will be uncorrelated to a reasonable 
extent, and their variances could be estimated recursively as follows: 


Arli) = 8A(i — 1) * (1 — 8)|a;(k)]*, k=0,1,...,.M-1 
using 0 < 8 < 1 (e.g., 8 = 0.9), and where ù; (k) denotes the k—th entry of 
Ui = [à,(0) u;(1) rk u;(M — 1)] 


The estimators {Ax (i) } can in turn be used to construct an estimator for the step-size matrix 
D for iteration i as D; = diag{A,(i)}. Now if we write down (26.16) for each entry of 
the weight vector, using the just constructed D;, we get 


e(i) = d(i) — üjtbii 


dui = dwu(i—-1l)-c4 vu 


which can be seen to be an NLMS-type update with power normalization. In summary, we 
arrive at the following statement; the result is stated in terms of realizations (d(), ui, Ax (i)} 
for the random variables (d(i), wi, Ax (i)}. 


Algorithm 26.1 (General transform-domain LMS) Consider a zero-mean ran- 


dom variable d with realizations {d(0),d(1),...}, and a zero-mean random 
row vector u with realizations (uo, u1,...). The weight vector w° that solves 


min E|d — uw? 
w 


can be approximated iteratively via w; = Tū;, where T is some pre-selected 
unitary transformation and Ù; is updated as follows. Start with A,(—1) = € 
(a small positive number) and #1 = 0, and repeat for i > 0: 


ü = uT [a0 (1 .. a(M—-1) ] 
ü;(k) = k-thentry of ü; 

(i) = pàrli — 1) + (1 - &)a(k)?, k—0,1,...,M -1 
D; = diag((i)) 
eli) = d(i)- uiii 

ü; = Wei uD; ure(i) 


where p is a positive step-size (usually small) and 0 «& 8 < 1. 


Example 26.1 (Decorrelation properties) 


In order to illustrate the de-correlation properties of the DFT and DCT transforms, consider the same 
random process (u(i)) as before with the exponential auto-correlation sequence (26.10). Choose 
a = 0.95 and M = 4. Then the covariance matrix Ru is given by 


1.000000 0.950000 0.902500 0.857375 
0.950000 1.000000 0.950000 0.902500 
0.902500 0.950000 1.000000 0.950000 
0.857375 0.902500 0.950000 1.000000 


Ry = 


Choosing T = F and applying it to Ra as indicated by (26.14), we obtain, after rounding to five 
decimal places, 


—0.02316(1 + j) 0 —0.02316(1 — j) 
m, 2 F*R,F = | 00710-7) 0.02316(1 + j) 0.04631 
0 0.02316(1 — j) 0.02316(1 + j) 
-0.02816(1--j)  —j0.04631 ^ 0.02316(1— j) 


while choosing T = C7 and applying it to E, leads to 


0 —0.04631 0 

RE m 0 0 —0.00084 
—0.04631 0 0 

0 —0.00084 0 


We see that the resulting matrices Ra are diagonally dominant and that the DCT transformation is 
more successful in transforming R, to a close-to-diagonal matrix. Further comparisons between 
both transforms in the context of LMS adaptation are given later in Fig. 26.5. 

o 


26.2 DFT-DOMAIN LMS 


The computational complexity of the transform-domain LMS filter of Alg. 26.1 depends 
on the selection of T' and on the manner by which this transformation is implemented. 
Assume, for instance, that T' is chosen as the DFT matrix (26.12). Even if the calculation 
a; = wT is performed using the fast Fourier transform (FFT), this step would require 
O(M log, M) operations per iteration. This cost is higher than the usual O( M) figure that 
is required by the standard LMS implementation (26.11). 

Still, the transformed filter can be implemented at a cost of O( M) operations per iter- 
ation by exploiting the fact that two successive regressors {u;—1, u;) share most of their 
entries. Thus, consider the regressors (u;1, us}: 


uci = 


Uj = 
and their transformed versions 
ü; = uT and G1 = -1T 


We can evaluate ü; directly from ū;—ı and from the entries {u(i), u(i — M)) as follows. 
The k-th elements of à; and ü;..1 are equal to the inner products between u; and u;_1 and 
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the k-th column of F', respectively, and are therefore given by 


M- 
ulk) = A (i- mec REP 


_ j2mmk 
M 


à; (Kk) 


1 

1 DAT 
& 
a 


M 
j2n(n—1)k l jank Qnnk 
tii(k) = Dui- M = —e M (Soa me M 
VM £4 M n=1 
E spe I ee nee IN 
= e u(t — nye + e u(i—M 
A (Y em) uu M 
E ^ l5) - Te ) E. aque 
= e ü u(i) + e i—M 
(9 - (0| + eae ui - M) 
so that a 
tilk) = e^ i(k) + [u(i) -u(i - M) /VM 
Collecting this result for k = 0,1,..., M — 1 we arrive at the relation (à;, à;..1) as 


ŭi = ūi-1S + Ju (u(i)- u(i — dis (26.20) 


where S is the diagonal matrix defined in the statement below. Expression (26.20) allows 
us to evaluate the transformed regressors iteratively at the cost of O( M) operations per 
iteration. 


Algorithm 26.2 (DFT-domain LMS) Consider the setting of Alg. 26.1. The 


weight vector w° can be approximated iteratively via w; = Fū;, where F 
is the unitary DFT matrix (26.12) and w, is updated as follows. Define the 
M x M diagonal matrix 


S a diag {1,e 3^, 675 gis up 


Then start with A,(—1) = e (a small positive number), #1 = 0, à, = 0, 
and repeat for i > 0: 


i = tiS + 7g O- uli- M)[1 1.. 1] 
ü(k) = k-th entry of a 
Akli) = Baki- 1) + (1 - 8)u(K)P, k=0,1,...,M-1 
D; = diag(A«(i)) 
ei) = d(i)-— uiia 
@ = wis, uD; ufe(i) 


where p is a positive step-size (usually small) and 0 « 8 < 1. The compu- 
tational cost of this algorithm is O(M) operations per iteration. 


26.3 DCT-DOMAIN LMS 


A similar derivation can be carried out when the transformation T is chosen as T = C'T, 
where C is the DCT matrix (26.13). Let 


ŭi = ui, 1 = uii, Ūi—2 = uj-eT 


In contrast to the DFT case, it turns out that there is now a relation between three successive 
transformed regressors. It is shown in App. 26.A that the following relation holds: 


(26.21) 


where S is a diagonal matrix and the {¢(k)} are scalars defined in the following statement. 


Algorithm 26.3 (DCT-domain LMS) Consider the setting of Alg. 26.1. The 
weight vector w° can be approximated iteratively via w; = CTū;, where C 
is the unitary DCT matrix (26.13) and i»; is updated as follows. Define the 
M x M diagonal matrix 


S = diag (2cos (kx/M), k 20,1,...,M —1) 


Then start with A,(—1) = e (a small positive number), ij. = 0, u_1 = 0, 
and repeat for i > 0: 


a(k) = [u(i)- u(i — 1)] cos (5) k=0,1,...,.M-1 
2M 

= o(k)la(k)—-b(k), k20,1...,M-1 

i = WS -ŭi + [ (0) $0) .. e(M-1)] 

k-th entry of a, 

BAk(i—1) + (1-B)\a(k)|?, k-—0,1...,M-1 

D; = diag(Ak(i)) 

d(i) — uiii 

D; = da + wD uje(i) 


b(k) = (70 u(i- M) -uli - M 1) ] 008 (5), k=0,...,.M—-1 
) 


(co 
~ 

e. 
— 


where u is a positive step-size (usually small) and 0 « 8 < 1. The compu- 
tational cost of this algorithm is O( M) operations per iteration. 


Example 26.2 (Performance comparison) 


Figure 26.5 compares the performance of four LMS implementations for a first-order auto-regressive 
process u(i) with a in (26.10) chosen as 0.95. The filter order is set to M = 8 and the ensemble- 
average learning curves are generated by averaging over 300 experiments. The step-size is set to 
u = 0.01 and the noise variance at —40 dB. It is seen from the figure that DCT-LMS and DFT- 
LMS exhibit faster convergence than a standard LMS implementation. Moreover, in the figure, the 
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learning curve for LMS with a pre-whitening filter is evaluated by transforming the error signal &(i) 
of Fig. 26.3 to the e(i) domain by filtering &(:) through /o2 A(z). 


Learning curves of original LMS, pre-whitened LMS, DFT-LMS, and DCT-LMS 


umi stp nh : aigu da. Baia; 1 ad adi MM id Ls a a ida S 
200 400 600 800 1000 1200 1400 1600 1800 2000 
Iteration 


FIGURE 26.5 A comparison of the learning curves of plain LMS, DFT-domain LMS, DCT- 
domain LMS, and LMS with pre-whitening for a first-order auto-regressive input process. 


26.A APPENDIX: DCT-TRANSFORMED REGRESSORS 


To arrive at (26.21), we start by noting that the k-th element of %; is equal to the inner product 
between u; and the k—th column of C7 and is therefore given by 


ü(k) = a(k) 3 uli me [mm 
= a(k) [ut eos (2x) + A — a(k)u(i — M) cos (rE) 


where we defined 


Ave 


NIe 


M-1 
jk(2n+3)x —jk(2n+3)n 
3 wi-n-1 (6 f Te E ) 
n=0 
and where we employed a change of variables from m to n = m — 1. Now note that 


M-1 M-1 
j2kn jk(2n - j2km =Í n T 
A= 3e 2H J u(i — n — 1)e UP + gui-n-10e 2M J et 
n=0 


n=0 


so that 
M-1 
2kr 2 k(2n 4- 1) 
= — = ats 1 
A 2eos (25) 298 n — 1) cos Ca ) 
l -izkz Met jk(2n41)n 1l i2kz Met —jk(2n4l)m 
ge 2M Y ui -n-0e 2 T Y ui -n-1e 
n=0 n=0 


which gives 


A = zie (5) üii(k) — PEE [er 4 es 
- D cos ($7) üii(k) — PY w-n-a mes + hien 
- am) CR DE ar) 
ES zm COS (=) ti-1(k) 


7 Esmee + u(i — 1) cos (2x) — u(i — M — 1)cos Ca) 


In summary, we arrive at the relation 


ti(k) = 2cos (x) Uici(k) — üi-»(k) + é(k) 


where 
é(k) ©  o(k)[a(k)- b(k)] 
ak) È u) ~uli- Deos (a) 
blk) ê (-1)¥ luli - M) -u(i = M - 1) cos (7) 


and where we used the fact that 


d CO) x eo 2) = (-1)* cos (x) 


Collecting this result for k = 0,1,..., M — 1 we arrive at the desired relation (26.21). 
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Efficient Block Convolution 


The transform-domain adaptive filters of the previous chapter were motivated by the 
desire to improve the convergence performance of LMS by exploiting the de-correlation 
properties of unitary transforms such as the DFT and the DCT. It turns out that these 
same transforms, and other similar ones, are also useful in reducing the computational 
cost per iteration of LMS below the O(M) figure. This cost reduction can be achieved by 
processing the data on a block-by-block basis rather than on a sample-by-sample basis. 


27.1 MOTIVATION 


As motivation, consider the setting of Fig. 27.1, which shows an FIR channel of length 
M; assumed long. The channel is excited by a zero-mean random sequence {u(i)} and its 
output is another zero-mean random sequence {y(z)}. At any particular time instant i, the 
state of the channel is captured by the regression vector 


ui = [u(i) u(i — 1) u(i—2) ... u4— MH 1)] 
and its output is measured in the presence of noise, 
d(i) = wig + v(i) (27.1) 


where the column vector g represents the channel impulse response, and v(i) is a zero- 
mean noise sequence uncorrelated with u;. Let G(z) denote the transfer function associ- 
ated with g, i.e., 


M-1 
G(z) Ê g(0)--g(1)27! + (2)7? +... +g9(M -1)2-M*! = Y g(k)a-* (27.2) 
k=0 


where the {g(k)} are the individual samples of g. Let also (u;, d(i), y(i)) denote observed 
values for the random variables {w;, d(i), y(i)}. 


An LMS adaptive implementation for estimating g is depicted in Fig. 27.2, with adap- 
tation equations given by 


dli) = ww; i, eli) =d(i)—d(i), w,;-wi + uufe(i) (27.3) 


This implementation requires O( M) operations per iteration and when M is large, the cost 
can be prohibitive. For example, in acoustic echo cancellation, a few thousand taps may be 
needed to adequately model the echo path. In such situations, we must seek more efficient 
adaptive implementations. 
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FIGURE 27.1 Noisy measurements of the output of a long FIR channel with an unknown impulse 
response vector g and transfer function G(z). 


As we shall see, block adaptive filters are well suited for such scenarios. While these 
filters are essentially equivalent to the LMS procedure (27.3), they nevertheless evaluate the 
error sequence {e(i)} and the estimates (d(i)) in a more efficient manner. They do so by 
working with transformed regressors and by processing the data on a block-by-block basis. 
In this way, they end up reducing the computational cost by a factor œ > 1 compared with 
a plain LMS implementation. Besides computational efficiency, block adaptive filters also 
exhibit better convergence performance than LMS. This is because the eigenvalue spread 
of the covariance matrix of the transformed regression data is usually reduced relative to 
that of the original regression data as a result of a band partitioning property; this property 
will be explained in greater detail in Sec. 28.2. 


FIGURE 27.2 A structure for adaptive channel estimation. 
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27.2 BLOCK DATA FORMULATION 


The first step toward developing efficient block adaptive filters is to explain how a long 
FIR filter, such as the one shown in Fig. 27.1 and independent of the adaptive context, can 
be implemented in an equivalent form that operates on blocks of data rather than on one 
sample at a time. We find it convenient to motivate the block implementations by working 
with the z—transform notation, while traditional derivations of block adaptive filters tend 
to be carried out in the time-domain. The z—domain arguments allows one to exploit 
the block structure to great extent and to motivate families of block adaptive filters that 
use other orthogonal transformations — see, e.g., Apps. 10.D and 10.E of Sayed (2003). 
The z—domain arguments also bring forth connections between block adaptive filters and 
another class of filters known as subband adaptive filters (see Sec. 28.2). 

Consider a long impulse response sequence g and its transfer function G(z), as in (27.2). 
We refer to G(z) as the fullband filter. Let also (Y (z), U(z)} denote the z—transforms of 
its causal input and output sequences [y(i), u()), 


Y() $ Vy U(z) $ Souls 
i-0 i=0 


Due to causality, these transforms are assumed to exist outside some circular domains, 
say, |z| > ry and |z| > ry, respectively, for some positive scalars {rg, ru}. Then the 
input/output filter relation y(i) = u;g translates into 


Y(z) = G(z)U(z) (27.4) 


which is depicted in Fig. 27.3. In this implementation, scalar entries u(i) are fed into the 
channel G(z) and scalar outputs y(i) are obtained as a result. 

Now we shall derive an alternative implementation that processes several input samples 
simultaneously and generates the corresponding output samples also simultaneously. In 
this implementation, data will be processed in a block manner, say, in blocks of size B 
each. To see how this can be done, we start by defining column vectors of length B as 
follows: 


u(nB +B — 1) y(nB+B-1) 
up, Ê YBn £ : (27.5) 
u(nB +1) y(nB +1) 
u(nB) y(nB) 


where the integer n = 0,1,... is used as a block index. For example, assuming the block 
size is B = 3, the block vectors {uB n, YB,n} will result from partitioning the streams of 


u(i) — e% y(i) 


FIGURE 27.3  Fullband processing whereby signals are processed on a sample-by-sample basis. 


data {u(z), y(2)) into successive blocks of size B each: 


u(0) u(1) u(2) (3) u(4) u5) (6) u(7) u(8) 


U3,0 us, u3,2 


y(0) y) w2) » w(3 w4 v5)  w(6) y7) w(8 
— M oe 
¥3,0 Y3,1 y3,2 
We shall assume, without loss of generality, that the filter length M and the block size 
B are such that the ratio M/B is an integer. Usually, both M and B are powers of 2, 


e.g, M = 1024 and B = 32 or some other values. Let (Up(z), Yg(z)) denote the 
z—transforms of the so-defined block causal sequences {u8 n, yB,n}, i.e. 


oo oo 
Yele) È yan Us(z) 2 Y upaz 
n=0 n=0 


Note that we are using calligraphic letters to refer to vector or matrix functions of z. The 
transforms (Up(z), Yp(z)) are B x 1 vector functions of z. Just like (27.4), there is also 
a relation between (p (2), 0p (2)). Indeed, let 


{P,(z), k=0,1,...,B-1} 


denote the so-called polyphase components of the fullband filter G(z): these are M/B- 
Jong FIR filters defined as follows: 


Po(z) = g(0)+ 9(B)z7! + g(2B)z7? +- 

Pí(z) = g(1)+g(B+1)z7! +g(2B +1)? +- 

Pa(z) = g(2)+g(B +2)! +g(2B +2)27? +. 076) 
Pg1(z) = g(B-—1)+ 9(2B —1)z7!+9(38B - 1)z 2 +- 


That is, the first B coefficients of G(z) are the leading coefficients of the ( P. (z)); the next 
B coefficients of G(z) are the second coefficients of the {P,(z)}, and so on. For example, 
assume B = 3 and M = 12. Then G(z) will have 3 polyphase components that are given 


by 

Po(z) 

P,(z) g(1) + g(4)2^! + g(7)2-? + g(10)2? 

P,(z) g(2) + g(5)27! + g(8)27? + g(11)2? 
In this way, as indicated in the diagram below, every third coefficient of G(z) is copied 
into the relevant polyphase component: 


g(0) + g(3)2^! + g(6)z~? + g(9)2? 


Po P P Po P P; Po P, P; 


To proceed, we collect the {P,(z)} into a B x B matrix G(z) as follows, e.g., for B = 3, 


Po(z) Pi (z) Po(z) 
G(z)= | a !P(z) Pez) F(z) 
zP(z) z iP(z) Po(z) 


(27.7) 
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FIGURE 27.4 Block processing whereby signals are processed on a block-by-block basis. 


The first row of G(z) contains all the polyphase components {P,(z)}. Moreover, G(z) 
has a pseudo-circulant structure. A pseudo-circulant matrix function is essentially a circu- 
lant matrix function with the exception that all entries below the main diagonal are further 
multiplied by 2^1. Recall that a circulant matrix is a Toeplitz matrix (i.e., one with iden- 
tical entries along the diagonals) with the additional property that its first row is circularly 
shifted to the right, one shift at a time, in order to form the other rows. 

With the functions (Yp(z),Unp(z), (2)) defined in this manner, some straightfor- 
ward algebra will show that expression (27.4) leads to the following block relation (recall 
Prob. 11.33): 


Ya(z) = G(z)Ua(z) (27.8) 


The result is depicted in Fig. 27.4. In this implementation, block entries ug,, are fed 
into the matrix filter G(z) and block outputs yg,, are obtained as a result. Comparing 
the implementations of Figs. 27.3 and 27.4, we see that the former relies on the transfer 
function G(z) while the latter relies on the matrix transfer function G(z). 

Although we have arrived at a block implementation scheme for G(z), this solution 
is still inefficient, and more needs to be done in order to transform it into a truly efficient 
block processing scheme. Specifically, note that the polyphase components P, (z) in (27.7) 
are polynomials in z^, and each one of them has in general degree M/B — 1. It follows 
that G(z) is a matrix polynomial function with highest degree equal to M/B. This means 
that we can express G(z) in the form 


G(z) =Go+ Giz | Gaz? +... + Gyjgz P (27.9) 


for some B x B coefficients {G+}. Then we could envision a block FIR structure for im- 
plementing G(z) in a manner similar to Fig. 27.1, with block coefficients {G;,} and block 
input and output vectors {uB n, yp,^). However, such a structure would be inefficient for 
two reasons. First, the M/B + 1 block coefficients {G;,} in (27.9) amount to a total of 
B(M + B) scalar coefficients, which is essentially B times larger than the (M + 1) coef- 
ficients {g(k)} we started with for the original fullband filter implementation of Fig. 27.1. 
Second, the coefficients {G;,} themselves are highly structured, as can be seen from (27.7), 
and we should be able to exploit this structure to our advantage. 

Our purpose is therefore to show how to devise an efficient structure for implementing 
C (2) and, hence, for carrying out the original convolution of Fig. 27.1 in a block manner. 
Once this is done, we shall then use the resulting block structure as the launching pad for 
developing block adaptive implementations instead of the tapped-delay-line implementa- 
tion of Fig. 27.2. These block adaptive solutions will be efficient in that they will require 
less computations than the LMS implementation (27.3) itself. 

As a prelude to our arguments, we start by noting that the matrix G(z) (27.7) can be 
factored as 


G(z) = P(z)Q(z) (27.10) 


where P(z) isa B x (2B — 1) matrix function with Toeplitz structure, e.g., for B = 3, 
Po(z) Pi(z Po(z) 0 0 
P(-2| 0 PB) A(z) Paz) 0 |; BxQB-1 (2711) 
0 0 Po(z) P (z) P2(z) 


and Q(z) is a (2B — 1) x B matrix with a leading identity block and a lower block with 
unit delays, say, for B — 3 again, 


0 
0 
1|];  QB-1)xB (27.12) 
0 
0 


27.3 BLOCK CONVOLUTION 


The initial step in our argument is to show how to use the discrete Fourier transform (DFT) 
in order to arrive at an efficient implementation of the block processing scheme of Fig. 27.4. 
For this purpose, we first remark that since it is usually desirable to work with sequences 
whose lengths are powers of 2 when dealing with the DFT, it is convenient to redefine the 
matrices P(z) and Q(z) in (27.11) and (27.12) as 


| Plz) Pilz) P2) 0 0 0 
P(z) = 0 Po(z) Plz) P(e) 0 0f; (Bx2B) (2713) 
0 0 Polz) Pilz) PX) 0 


and 
1 0 0 
0 1 0 
e) = |, 2 |;  QBxB) (27.14) 
0 2 O 
0 0 27} 


with an additional zero column added to P(z) and an additional row added to Q(z). Of 
course, the product P(z) Q(z) remains equal to G(z). Now, however, P(z) is B x 2B and 
Q(z) is 2B x B, and the factor 2B will be a power of 2 whenever B is, which is generally 
the case. 


Transfer Function Formulation 


With the matrices {P(z), Q(z)) so defined, we embed P(z) into a 2B x 2B circulant 
matrix C(z), say, for B = 3, 


Po(z) Pi(z) Po(z) 0 0 0 
0 Po(z) E ao € ) 0 
= 0 0 Po(z) Pilz) Patz 0 
EG) == 9-79 ia) Be) BO vee 
0 0 0 Ple) Piz) 
Pi(z) Pa(z) 0 0 0 Ph) 
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Note that P(z) can be recovered from the top B rows of C(z) via 
P(z) = [is OsxalC(z) (27.16) 


where Ig is the B x B identity matrix and 0px g is the B x B null matrix. The main 
reason for embedding P(z) into C(z) is the following result. Let 


[Flim 5 e- aE" k,m=0,1,...,2B—1 (27.17) 


denote the entries of the DFT matrix of size 2B x 2B. In contrast to (26.12), from now 
on we shall use F to refer to the standard DFT matrix without the scaling factor 1/ VM; 
the scaling was used earlier in (26.12) while studying transform-domain adaptive filters in 
order to enforce a unitary F. Now it is a well-known result that any circulant matrix, such 
as C(z) above, can be diagonalized by the DFT matrix — see Prob. VI.6. More specifically, 
it holds that 

C(z) = F*L(z)F (27.18) 


for some 2B x 2B diagonal matrix function C(z) with entries 
£(z) = diag(Lo(z), Li(z),..., L2B-1(2)} (27.19) 


Each L,(z) is an FIR transfer function with M/B coefficients — see Prob. VI.8. Using the 
fact that F = FT, it is easy to verify by transposing (27.18) that the entries of the first row 
of C(z) are related to the diagonal entries of C(z) via, e.g., for B = 3 — see Prob. VI.9: 


Fo(z) Lo(z) 

SES 

Po(z z 
1 =F L. (2) (27.20) 
0 La(z) 
0 La(z) 


This important relation tells us how to map the polyphase components {P;,(z)} into the 
diagonal components {Z;,(z)} and vice-versa. It should be clear though that although any 
diagonal matrix £(z) in (27.18) will always result in a circulant matrix C(z), it does not 
hold that any such £(z) will result in a circulant matrix C(z) that has the special form 
(27.15). This is because the transformation (27.20) requires the (L&(2)) to be such that 
the last B entries of the transformed vector are zero. 

Combining (27.16) and (27.18) we can express the factorization 


G(z) = P(z) Q(z) 


as 


G(z) = [Is Opxa]C(z) Q(z) = [Ia sxe] F*L(z)F Q(z) (27.21) 


P(z) C(z) 


This result shows that the transfer matrix function G(z) in Fig. 27.4 can be obtained as the 
top B rows of the transfer matrix function F* C(z)F Q(z). In other words, the mapping 
from up, to yp,» in Fig. 27.4 can be alternatively implemented as shown in Fig. 27.5 


discard 
last B 
outputs 


FIGURE 27.5 Equivalent implementation of the mapping in Fig. 27.4 for block convolution in 
terms of the DFT matrix. 


in terms of (F, L;,(z), Q(z)). We shall refer to the {Z;,(z)} as the subband filters. The 
reason for the terminology will become clear later in Sec. 28.2 where we explain that block 
adaptive filters essentially partition the signal bandwidth into several frequency subbands 
and then process the signals within these subbands. 


Time-Domain Formulation 


In the time-domain, the implementation of Fig. 27.5 amounts to performing the following 
operations. First, applying the 2B x B matrix function Q(z) to up, results in the output 
vector 


UB,n-1 
which, for example, for block size B — 3 and at n — 2, has the form 


UB,n | (27.22) 


For this reason, we shall say that the effect of Q(z) is to perform a serial-to-parallel (S/P) 
conversion of the data. Another way to implement (or describe) this S/P conversion process 
is as follows. Let | B denote a decimator of order B, denoted by 


where the output z(n) is related to the input u(i) as follows 
z(n) 2u(nB), n=0,1,2,... 


In other words, the sequence at the output of the decimator is obtained by keeping the 
samples of the input sequence that occur at multiples of B, and ignoring the other samples. 
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In this way, the output of the decimator is a lower-rate representation of its input sequence; 
the original rate is scaled down by a factor of B. Observe that we are using the time 
index n for the lower-rate signal and the time index i for the higher-rate signal. Using 
the decimation-block representation, we can construct the block data vector (27.22) from 
the input sequence {u(i)} as shown in Fig. 27.6; after every B input samples, a vector 
col{u Bn, UB,n—1} is formed as in (27.22). 

Once this is done, the vector (27.22) is processed by the DFT matrix F, resulting in a 
2B x 1 transformed vector, whose entries we denote by 


up (n) 
al 4 ui (n) A F UB.n 
2B,n : UB,n-1 
ug. 3(n) 


i.e., we shall use primes from now on to denote transformed data. The entries of {u/,(n)} 
are then fed into the subband filters {Z;,(z)} and the resulting outputs are processed by 
F*. The top B outputs of F™ are the desired output vector yp n- 

In a manner similar to constructing ug,n from the {u(i)} by means of decimators, we 
can conversely recover the output (y(i)) from yg,n by means of interpolators or a parallel- 
to-serial (P/S) conversion. While the description that follows is not necessary for our future 
discussions, it is nevertheless a useful precursor to our treatment of subband adaptive filters 
later in Sec, 28.2. 

Let 1 B denote an interpolator of order B, denoted by 


a(n) [TB ]— v() 
where the output y(i) is related to the input z(n) as follows: 


(sf * ($)  ifi/B is an integer 
Y 0 otherwise 


In other words, the sequence at the output of the interpolator is obtained by adding (B — 1) 
zeros between the samples of the input sequence. In this way, the output of the interpolator 
is at the same original rate as {u(i)}. 


u(i) 


UB,n 


l UB,n-1 
gol 


FIGURE 27.6 Formation of the block data vector col{us,n,uB,n-1} by means of a delay 
line with 2B decimators. This decimation-based structure is equivalent to processing by Q(z) in 
Fig. 27.5. 


Using the interpolator-block representation, we can recover the {y(z)} from {yB,n} as 
shown in Fig. 27.6; the structure in the figure performs parallel-to-serial data conversion. 
In order to understand how the structure functions, assume B = 3 so that for the first three 
initial block iterations: 


y(8) | y(5) | | y(2) 
yBj2 = | y(7) |, yBi= | y(4) |, yBo=| y(1) 
y(3) ) 


Clearly, the block at n = 2 becomes available at time instant i = 8. The signals at the 
output of each of the interpolators are then 


y(8) y(5) y(2) — 13 |— y(8) 0 0 y(5) 0 0 y(2) 


y(7) y(4) y(1) —| 13 |. y(7) 00 y(4) 00 y(1) 


y(6) y(3) y(0) —| 13 |— y(6) 00 y(3) 00 y(0) 


Therefore, when n — 2, which corresponds to time instant i — 8, the first signal appearing 
at the output of Fig. 27.6 would be y(6) = y(i — B + 1), followed by y(7) and y(8). We 
thus see that a delay of (B — 1) samples is introduced in the signal path. 

With these remarks we conclude that the structure of Fig. 27.5 amounts in the time- 
domain to the operations shown in Fig. 27.8, with the outputs of the {L;,(z)} filters denoted 
by 


We therefore find that the fullband FIR implementation of Fig. 27.1 can be equivalently 
implemented as a bank of 2B FIR filters, {LZ,(z)}, of length M/B each and which operate 
at 1/B the original data rate. In addition, the input signals to these shorter filters are 
obtained after processing a block of 2B input data, shown in (27.22), by the DFT matrix. 
Observe that while the implementation of Fig. 27.1 relies on the M coefficients of G(z), 
the one in Fig. 27.8 relies on a total of 2B x (M/B) — 2M coefficients of the filters 
(Lx (z)). So it may seem, at first sight, that any subsequent adaptive implementation that 


y(nB + B — 1) zi 


z 


y(nB +1) 
y(nB) —>[1 s] 


FIGURE 27.7 Reconstruction of the sequence {y(i)} from the entries of yg,n by means of B 
interpolators. Observe that the output sequence is y(i — B + 1) with a delay of B — 1 samples. 


y(i — B1) 
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u(i) uo(n) yo(n) 


discard 
last B 
outputs 


ugp—1(n) V2n-i(n) 


2B x 2B 2B x 2B 


FIGURE 27.8 Time-domain equivalent of the block implementation of Fig. 27.5, using 2B 
decimators at the input and B interpolators at the output. A delay of (B — 1) samples is introduced 
in the signal path. 


estimates the 2M coefficients of {Z,,(z)} would be costlier than the one that estimates the 
M coefficients of G(z) (as in Fig. 27.2). However, as we shall see, this is not the case 
because the filters L;,(z) operate at a sampling rate that is B times slower than G(z) (and, 
therefore, their coefficients will only need to be adapted at this slower rate). This difference 
in the rate of operation leads to significant savings in computation. 

In summary, so far, we have described two ways for implementing a transfer function 
G(z). The first one is the fullband implementation of Fig. 27.3 whereby the output of 
G(z) is evaluated one sample at a time by convolving the sequence (u(i)) with the im- 
pulse response sequence of G(z), i.e., via the inner product y(i) = uig, where u; is the 
M —dimensional regressor 


ui = [u(i) u(i — 1) ... u(i - M +1) 


The inconvenience of this procedure is that it requires convolution with the long filter G(z). 

The second implementation is the one shown in Fig. 27.8, which relies on a bank of 
2B filters (Li(z)) applied to the transformed data {u} (n)). This structure operates in a 
block manner, with the processing by the input DFT transformation performed after each 
block of B input data is collected, as indicated by Fig. 27.6. This implementation is more 
efficient than that of Fig. 27.1, as we are going to show in the next subsection. Neverthe- 
less it suffers from a delay problem since it introduces a delay of (B — 1) samples in the 
signal path. In Prob. VI.14 we show how to modify the frequency-domain implementation 
of Fig. 27.8 in order to remove the delay problem. Basically, a direct convolution path of 
length B is added to Fig. 27.8. 


Computational Complexity 

Let us now compare the computational complexity of the implementations of Figs. 27.1 
and 27.8. To do so, we shall only focus on the required number of multiplications per input 
sample. We shall also assume that a DFT of size K requires approximately K loga (K) 
complex multiplications when evaluated by means of the fast Fourier transform (FFT). 


In the direct convolution implementation of Fig. 27.1 we need to compute an inner 
product of order M, which translates into M complex multiplications per input sample. 
In the DFT-based implementation of Fig. 27.8, on the other hand, the following steps are 
necessary for each block of data of size B: 


1. A transform of size 2B to compute the product F : col{uB,n, UB,n-1}. This step 
requires B log.(2B) complex multiplications. 


2. Filtering by 2B filters {L,(z)} of order M/B each. Each filtering operation requires 
an inner product of size M/B and, therefore, M/B complex multiplications. In 
total, this step requires 2M complex multiplications. 


3. A second transform of size 2B to generate the output signals. This step requires 
Blog,(2B) complex multiplications. 


Steps 1-3 add up to 2M + 2Blog, 2B complex multiplications for each block of input 
data of size B. 

In addition, the filters {L,(z)} themselves need to be determined from G(z). From 
(27.20) we see that the coefficients of {L;,(z)} are found by applying F* to a matrix with 
the polyphase components of G(z). Specifically, using 


F*F =2B lop 
we have that 
Lo(z) Po(z) 
Li(z : 
il n ras 5 F* : (27.23) 
: 2 Pz_1(2) 
Lag-i(z) 08x1 
e.g., for B = 3 and M = 12, 
loo loi loz los g(0) 9(3) g(6) g(9) 
ho lu hie he i g(1) 9(4) g(7) g(10) 
lo lor l2 log | _ l} g(2) g(5) g(8) g(11) 
ili do sg) UR 1. 9.- 9. D. ae 
lso isı ls2 ls3 0 0 0 0 
2BxM/B 2BxM/B 


This step therefore requires computing M/B DFT’s of size 2B each, which amounts to 
a cost of M log.(2B) complex multiplications. If we normalize by the size of the block, 
we get a cost on the order of x log; (2B) complex multiplications per input sample. For 
a fixed G(z), this cost corresponds to a one-time overhead since the computation of the 
(Li (2)) is done once and used thereafter for all input samples. However, as we shall see, 
for adaptive implementations, the calculation of the {Z;,(z)} needs to be repeated once for 
every block of data. 

We therefore find that the cost associated with the implementation of Fig. 27.8 is ap- 
proximately 


E T (E --2)log;(2B) complex multiplications per input sample 


The main conclusion is that while the direct convolution method of Fig. 27.8 requires 
O(M ) operations per sample, the frequency-domain implementation of Fig. 27.8 requires 
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O(2M/B) operations per sample; the reduction in complexity is determined by the block 
size B. However, larger values for B result in longer delays in the signal path. 


Algorithm 27.1 (Block convolution via DFT) Consider an input/output map- 
ping as shown in Fig. 27.8 for some transfer function G(z) with M taps. The 
direct convolution operation can be implemented more efficiently as shown in 
Figs. 27.5 and 27.8 Specifically, choose a block size B; usually M and B are 
powers of 2 and M/B is an integer. 


Bank of filters. Determine the polyphase components of G(z) of order M/B, 
as defined by (27.6), and use (27.23)-(27.24) to determine the 2B filters 
{Lr(z)} of size M/B each. 


Block filtering. Start with ug,_1 = 0 and repeat for n > 0: 
1. Construct the block vectors: 


ug n = col{u(nB+B-1),...,u(nB)}, U2B n =col{ugn, uBn-i} 


2. Perform the DFT transformation: 
Ujga = FuzB,n col(uz(n), k=0,1,...,2B—1} 


and filter the entries {u; (n)} through the subband filters {Z,(z)}. Let 
YaB n = col{y,(n),k = 0,1,...,2B — 1} be a vector of size 2B x 1 
that collects the outputs of these filters at iteration n. 


3. Perform the DFT transformation: 


[a] = Fen 


where x denotes B entries to be ignored, and yp, = col{y(nB+ B — 
1),...,9(nB + 1), y(nB)). 


PEG RE 


Example 27.1 (Cost comparison) 


In order to compare the computational costs of the direct convolution method of Fig. 27.3 and the 
frequency-domain implementation of Fig. 27.8, we plot in Fig. 27.9 two curves that show how the 
ratio 

frequency-domain cost — 26. + (4 + 2) log, (2B) 
=m m ———————————— (27.25) 
direct convolution cost M 
varies as a function of M and B. In the top plot of the figure we fix the fullband length at M — 1024 
taps and vary the block size B in powers of 2 between 2 and 1024. We see that the ratio drops 
below one (and, hence, the frequency-domain implementation becomes more efficient) for values of 
B larger than or equal to 16. In the bottom plot we fix the block length at B = 32 and vary the 


Frequency-domain cost relative to fullband implementation as a function of the block length 


[=] 
5 1.54 Frequency-domain implementation 
t is more efficient for block lengths B larger than this value 
mm EE 
0.5r 


2 4 8 16 32 64 128 256 512 1024 
Block length (B) 


Frequency-domain cost relative to fullband implementation as a function of filter order 


Frequency-domain implementation I-32 
is more efficient for filter orders M higher than this value 
th eee m mS 
Sg 
a js a 
c 0.8 
0.6- 4 
—L 
32 64 128 256 512 1024 


Filter order (M) 


FIGURE 27.9 Plots of the ratio (27.25) for fixed M and varying B (top curve) and fixed B 
and varying M (bottom curve). The ratio compares the cost of the fullband and frequency-domain 
implementations of Figs. 27.3 and 27.8. When the ratio drops below one, the frequency-domain 
implementation becomes more efficient than the fullband implementation. 


fullband filter length M in powers of 2 between 32 and 1024. We see that in this case, the ratio drops 
below one for M larger than or equal to 64. A 


Remark 27.1 (Overlap-save and overlap-add structures) The implementation of Fig. 27.8 
and Alg. 27.1 is referred to as an overlap-save block-convolution procedure. The term overlap-save 
is used to indicate that successive input blocks {u28,n } are overlapped prior to the DFT operation 
and only half of each block is saved. Observe, for example, that 


UB,n d UB n+1 
U2B,n = : 2B,n41l = 


which shows that 425,5 and u2B,n+1 share a common sub-block. There is an alternative overlap- 
add procedure for block convolution. In this case, the successive input blocks {u28,n } do not share 
a common sub-block. This structure, and the corresponding block adaptive filters, are discussed in 
App. 28.B. 


o 
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Block and Subband Adaptive Filters 


N.. that we know how to implement a long FIR filter (i.e., a long fullband convolution) 
efficiently by means of block processing, we can proceed to show how this result can be 
used to develop efficient block adaptive filters. 


28.1 DFT BLOCK ADAPTIVE FILTERS 


Thus, refer to the block implementation of Fig. 27.8. In a manner similar to (27.5), we 
define the vectors 


d(nB + B-1) v(nB + B—1) 
4 : A : 
dB,n = : UB n = . 
d(nB + 1) v(nB +1) 
d(nB) v(nB) 


We further let l denote the M/B x 1 weight vector associated with L;,(z), and introduce 
the corresponding 1 x 4 regression vector 


Unn =| u(n) w(n-1) ... u(n—-%41) | 


Then the output of each L;,(z) is given by the inner product u% „lx. With this notation, 
we can express dg,n in terms of the subband filter coefficients {l+ } as follows. Collect the 
regressors {u} n} into a block diagonal matrix 14; 


4 


ui, m ;  (2Bx 2M) (28.1) 


$ 
/ 
U2B—1,n 


and the corresponding filter coefficients into a 2M x 1 vector 


l 
ê 1 f: (Mx 1) (28.2) 
lagi 


Then the vector $55 ,, at the output of the bank of filters {Z;,(z)} in Fig. 27.8 is equal to 
the product U/, so that the output yg,» is given by 


Yan = [Ip Osxe ]F'Ujl 
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and, consequently, since dB n = YB,n + UB,n, We have 
dan = [Ip Opxp | F'U,l + vBn (28.3) 


We can rewrite this relation in terms of stochastic data as 


dgn = [Ip Osxp | F'U,l + van (28.4) 
with the boldface symbols {dB n, 047, v 5,4) denoting random variables whose realiza- 
tions are {dB n, Uh, vB}. We can now motivate an adaptive implementation for estimat- 
ing the channel g (or, correspondingly, its subband filters {1,}) by posing the problem of 
estimating / from dg,n in (28.4) in the linear least-mean-squares sense, namely, by solving 
a problem of the form 


min E | dgn- | IB O0g.p |] P'U, Ih (28.5) 


where W is 2M x 1. In this problem, the matrix [ Ip 0sBxB ] F*U., plays the role of the 
regression data. An LMS procedure for estimating / would then operate on the realizations 
{dB n; Un} and take the form 


dg. = [ls Osx |] F*U,Dn-1 (28.6) 

eBn = dpn—dan (28.7) 

Wn = Wnr-1 sar | Ql Jes (28.8) 
ÜpxB 


Equations (28.6)-(28.8) are equivalent to 
Dn = Un-1 + MUk egn D-1=0 (28.9) 


where we are defining the 2B x 1 transformed error vector 


€5B,n Ep | ur Jean (2B x 1) (28.10) 
x 


From the 2M x 1 vector W, we can recover estimates for the coefficients of (Li(z)); 
they are simply stacked on top of each other in Wn. Specifically, from (28.9), the update 
for the estimate of the k—th weight vector J, is given by (in terms of the k—th entry of 
€5p,,, and the k—th regression vector u, ,,): 


lin =len-1t uujatk(n), 1-120, k=0,1,...,2B-1 (28.11) 


Unconstrained Filter Implementation 


Usually, an NLMS-type update with power normalization is employed instead of LMS so 
that (28.11) is replaced by 


=lkn-it EOS y. e150, Reus (28.12) 
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with each A, (n) evaluated via 
Ak(n) = bàrln — 1) + (1 — B)u(n)?, OKB<1 


with initial condition A,(—1) = e (a small positive number) and where 0 < 8 < 1 (e.g., 
B = 0.9). The use of power normalization helps improve the convergence performance of 
the algorithm. This is because, as we are going to explain later in Sec. 28.2, the spectrum 
of the sequence (uj (n)? at the input of each subband filter is approximately flat. Then the 
{Ax (n)) provide estimates for the input powers across the subband filters and normaliza- 
tion by them helps normalize the signal powers across the subbands. 


In summary, we arrive at the statement below. Figure 28.1 shows a block diagram rep- 
resentation of the algorithm. This implementation introduces a delay of B — 1 samples 
in the evaluation of the sequences {d(i), e(i)} due to the parallelization of (d(i), u(i)}, as 
explained in Fig. 27.8. Later, in App. 28.A we comment on how Fig. 28.1 can be modified 
to ameliorate this delay problem (see Figs. 28.9 and 28.10). 


Algorithm 28.1 (Unconstrained DFT block filter) Consider the adaptive chan- 


nel estimation scheme shown in Fig. 27.2, where wj; is M x 1 and the 
regressor u; is 1 x M. For relatively long channels (i.e., large M), a more 
efficient adaptive procedure for generating the estimates d(i), and the cor- 
responding error sequence {e(i)}, is the following. Choose a block size B; 
usually M and B are powers of 2 and M/B is an integer. Select 0 « 8 < 1, 
set lk, .; = 0, A&(—1) = e (a small positive number), and repeat for n > 0: 


ugn = col(u(nB-- B — 1),...,u(nB + 1), u(nB)) 
pn = r| SEN | Ê col{uj(n), k =0,1,...,2B— 1) 
i UB n-1 


Ak(n) m BX(n — 1) + (1 — Bju (n)/?, k —0,1,...,2B- 1 
Upn = [w(n) e. w(n-X-1)], k=0,1,...,2B-1 
yk(n) = Unnlen-1, k-20,1...,2B-1 


dg. = [Is 0gxg | F'col (yo(n),....9551()] 
EBn = dB,n zs dps 
Coan = F | s | eBn = colfe,(n), k20,1,...,2B - 1) 


"a | oA ix / = = 
lkn = lkn-1+ zp (n) rekl), k=0,1,...,2B-1 


At each block n, the entries of {dp n, €B,n} correspond to 


dps col(d(nB + B — 1),..., d(nB + 1, d(nB)) 
eBn = col(e(nB +B — 1),...,e(nB - 1), e(nB)) 


Il 


B zeros 


FIGURE 28.1 The unconstrained DFT-based block adaptive filter of Alg. 28.1. The input signal 
u(i) is processed by a bank of 2B decimators, while the reference signal d(i) is processed by a bank 
of B decimators. The error sequence {e(i)} is generated, with a delay of (B — 1) samples, by using 
a bank of B interpolators. 


e28- (n) 


2B x28 


Alternatively, a standard e—NLMS recursion could be employed instead of (28.12), say, 


E Ix ot 
lkn = lkn-1 + — rUe » dg-i1i-0 28.13 
kn kyn-1 "nm Jet, p ene (m) k,-1 ( ) 


with the step-size in (28.12) M/B times smaller than the step-size in (28.13), as explained 
following (11.10) in our discussions on e-NLMS with power normalization (recall that 
M/B is the length of each of the subband filters). The computer project at the end of the 
chapter compares the performance of (28.12) and (28.13). 

It is useful to note that the sequences (d(i), e(i)) that are generated by Alg. 28.1 do 
not match exactly the sequences (di), e(i)} of the fullband adaptive implementation of 
Fig. 27.2. There are at least two reasons for this difference: 


a) First, even if we use NLMS with power normalization instead of LMS in (27.3), the 
normalization that is performed by Alg. 28.1 is carried out in the frequency-domain 
and across the bank of subband filters, with a separate normalization for each filter. 


b) A more significant difference is the fact that, in general, the successive filter esti- 
mates {lx n } that are computed by Alg. 28.1 do not necessarily satisfy the constraint 
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(27.20); i.e., in the time-domain, the transformation 


(28.14) 


n] 
2B—-1,n 


need not result in a matrix whose last B rows are zero, as required by (27.20) — 
see also (27.24). In this way, the resulting estimates for the polyphase components 
(Pk (z)) will not correspond to a circulant matrix function of the form (27.15). It is 
for this reason that the implementation of Alg. 28.1 is referred to as unconstrained; 
the qualification means that the successive weight estimates {lx n} are not being 
constrained so as to guarantee the structural requirement (27.20). 


Constrained Filter Implementations 


One way to enforce the constraint (27.20) into recursion (28.12) is as follows. After each 
iteration n, we multiply (28.14) by (I$ 08x p) from the left, i.e., we perform the operation 


l 
ld = eh 
OBxB OBxB : 


iT 
2B-l1,n 


which amounts to zeroing out the lower B components. Then the rows of X will be esti- 
mates for the coefficients of the polyphase components {P;,(z)}. More explicitly, 


Pon n 
T 0,n 
Pia I Il. 
: - | i B 7 (28.15) 
z 08xB : 
DB-in i5. i 
TE ~1n 


OBxB 


where px,n denotes a column vector with the estimates for the coefficients of the k—th 
polyphase component P;,(z). With the estimates (px,4) so defined, we can return to the 
frequency domain by multiplying the result by F*, as required by (27.24), in order to 
enforce the desired constraint on the subband filters. We shall denote the resulting subband 
filters by (I£ ,,}, with a superscript c used to indicate that they are obtained from the (14, ) 
by enforcing the constraint (27.20): 


eT Pan 
or Pla 
it ' 
tn jalp : (28.16) 
2B E 
ist DB-1n 
2B-1,n 0px B 


Thus, note that the estimates {JZ „} are now such that the product 


ct 
2B-1,n 


satisfies the constraint (27.20), with the required zeros on the left-hand side. 


In summary, we arrive at the statement of Alg. 28.2 below. The main difference in 
relation to the unconstrained implementation of Alg. 28.1 occurs in the evaluation of the 
constrained weight vectors {Jf „} and their use in computing the filter outputs (y; (n)]. 
All other recursions remain unchanged. Figure 28.2 shows a block diagram representation 
of the algorithm; the errors {e} (n)) are used to adapt the unconstrained vectors {lkn}, 
which are subsequently corrected to {If ,,}. In the figure, we only show the constrained 
weight vectors {l£ ,, }. The filters of Algs. 28.1 and 28.2 are usually referred to as multi- 
delay filters (or MDF) with the qualification *multi-delay" used to mean that each of the 
2B subband filters ( Lj, (z)) has multiple coefficients. Of course, a special case would be to 
choose B = M (i.e., to choose the block size equal to the fullband filter length), in which 
case each subband filter will consist of a single coefficient. 


y———————— Á—— ————————————————— i 
Algorithm 28.2 (Constrained DFT block filter) Consider the adaptive channel 
estimation scheme shown in Fig. 27.2, where w;_1 is M x 1 and the regressor 
ui is 1 x M. For relatively long channels (i.e., large M), a more efficient 
adaptive procedure for generating the estimates d(i), and the corresponding 
error sequence {e(i)}, is the following. Choose a block size B; usually M 
and B are powers of 2 and M/B is an integer. Select 0 « 8 < 1, set 
lei = lk 1 = 0, Ax(—1) = 0, and repeat for n > 0: 


ugn = col{u(nB+B-1),...,u(nB+1),u(nB)} 
Ugn = F | Nds | & col(u;(n), k 2 0,1,...,2B — 1) 
Ak(n) = BXk(n — 1) (1 — 8)|uz(n)P?, k —0,1,...,2B-1 
Up. = uj) o u(n- # +1) ], k=0,1,...,2B-1 
y(n) = Uknlkn-i k=0,1,...,2B-1 
dp, LEE [ Ip OBxB ] F*col (von), ....9951(0)] 
EBn = dB,n = dan 
an = F Ip epa £ col(e,(n), k 20,1,...,2B 1) 
OBxB i 
H Pk / 
lkn = lkn-1 +p ne, (nr), k-20,1...,2B-1 
k, ki > py Ue p(n) 
i d 
li "A Ip Lin 
d z 2B ÜpxB 
bia lin 


At each block n, the entries of {dB n, €B,n} correspond to 


dps col(d(nB + B — 1),..., d(nB + 1), d(nB)) 
eg, = col{e(nB+B-1),...,e(nB+1),e(nB)} 
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2B x28 


FIGURE 28.2 An implementation of the constrained DFT block adaptive filter of Alg. 28.2. 
The input signal u(z) is processed by a bank of 2B decimators, while the reference signal d(i) is 
processed by a bank of B decimators. The error sequence {e(i)} is generated, with a delay of 
(B — 1) samples, by using a bank of B interpolators. Moreover, the errors (ej (n)) are used to adapt 
the unconstrained vectors {Ix,n}, which are subsequently corrected to {Jf n}. In the figure, we only 
show the constrained weight vectors (i£ n }- 


Computational Complexity 

The overall computational requirements of the block adaptive filters of Algs. 28.1—28.2, 
and of their other constrained and overlap-add versions in Apps. 28.A and 28.B, are com- 
parable (i.e., the algorithms require essentially the same order of computations). For this 
reason, it is enough to study the complexity of just one of the algorithms. We choose the 
multi-delay filter of Alg. 28.2 or Fig. 28.2. The overall complexity of the algorithm can be 
divided into four parts: 


1. Subband decomposition of the input and error signals (u(i), e(i)) in order to form 
the transformed vectors (u55 ,,,e5p,,]. This step requires two DFTs of size 2B 
each for each block of B input samples. Assuming that the cost of a K —size DFT 
is K log K complex operations, we find that the total cost is 


1 

B. 2.(Blog,(2B)) = 2log,2B operations per input sample 

2. Updating of the subband filters. This step requires that we update 2B filters of 
length M/B each by using NLMS with power normalization. The updates are per- 
formed once every B input samples. Now since the update of a K-long NLMS filter 


requires approximately 2K complex operations, we find that the total cost is 


2B.— = aM operations per input sample 


3. Enforcement of the constraint (27.20). This requires M/B transforms of size 2B 
for each block of B input samples. The complexity of this part is similar to the 
first one, except that here we need to compute M/B transforms rather than only 2, 


leading to 
M 
5 . 5 - (Blog, 2B) = B log;(2B) operations per input sample 


The computational burden of this part can be reduced if we apply the constraint less 
often than every B samples, say, every pB samples where p is an integer larger than 
one. 


4. Inverse transformation. This step requires one DFT of size 2B to map the signals 
(yi (n)) into the time-domain signals (yi (n)). The cost would be 


Blog.(2B) operations per input sample 


In summary, we find that the computational complexity per input sample of the multi- 
delay filter of Alg. 28.2, and similarly of the other block adaptive filters, is on the order 
of 

E -( E + 3) log2(2B) operations per input sample 


The main conclusion is that the cost is O(M/ B) operations per input sample. 


Remark 28.1 (Other block adaptive filters) The DFT block adaptive filters described so far, 
and also in Apps. 28.A and 28.B, are derived by embedding the matrix P(z) of (27.13) into the 
circulant matrix C(z) in (27.15). This latter matrix was then diagonalized by the DFT, as shown by 
(27.18). Now we could have embedded P(z) [or P(z) from (27.11)] into other matrices that are 
not necessarily circulant, but which can still be diagonalized by some other transforms, say, by the 
trigonometric transforms. In this way, we would be able to derive other block adaptive structures. 
This point of view is pursued in Apps. 10.D and 10.E of Sayed (2003), where it is shown how to 
derive block adaptive filters that are based on the discrete-cosine and discrete Hartley transforms 
(DCT and DHT). 


o 


28.2 SUBBAND ADAPTIVE FILTERS 


Besides transform-domain and block adaptive filters, there is another class of adaptive al- 
gorithms that achieves computational savings and improvement in performance over a con- 
ventional LMS implementation. This third class of algorithms, known as subband adaptive 
filters, has a close relation to DFT-based block adaptive filters and we shall exploit this 
connection to motivate subband adaptive filters. 

Define the filters 


2B-1 
Hy(z) È YO ze, k= 0,1,...,2B-1 


n=0 
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In particular, Ho(z) is the moving-average (low-pass) filter 
Ho(z) =1 +27! +... +2703) 
with a rectangular window as its impulse response, while the remaining filters are related 


to Ho(z) via 
Hy(z) = Ho(ze ??7*/2) 


or, in terms of their frequency responses, 
H,(e”) = Ho (e(--#)) 2  k=0,1,...,2B—1 


The filter Ho(z) is called the prototype filter since the other filters {H;,(z)} are generated 
from it. Its frequency response is given by 


f 2B, w=0 
jw) — , = i 
Hole") = 4 ezio. E] b otherwise 
sin p 


and is shown in Fig. 28.3 when B = 8. Its first zero crossing occurs at wo = v / B, which 
is effectively half the width of its main lobe. We thus say that its approximate bandwidth 
is 27/ B radians/sample. The attenuation of the first side lobe is approximately 13 dB 
relative to the main lobe. The frequency responses of all other filters ( H4. (2)) are obtained 
from Ho(e/") by shifting the latter to the frequencies (7k/ B). 


Magnitude response of the prototype filter H (2) using B=8 
30 


13dB attenuation : 


m re n eee l | 
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FIGURE 28.3 Magnitude frequency response of the prototype filter Ho(z) for B = 8. The width 
of the main lobe is 27/B = 7/4 rad/sample. 
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FIGURE 28.4 Two equivalent representations in terms of the DFT matrix and a set of DFT- 
modulated bandpass filters ( H.(2)). 


Now using the expression for the DFT matrix from (27.17), it is easy to see from the 
definition of the {H;,(z)} that they are obtained as follows: 


H(z) 1 
Hi(z zt 
eee 
Hop-1(z) z QB-1) 


This result shows that the DFT transformations from (u(i)) to the {u} (n)) on the left- 
hand side of Fig. 28.4 can be redrawn equivalently as shown on the right-hand side of the 
same figure in terms of the DFT-modulated bandpass filters ( H),(z)}. In other words, the 
step on the left involving decimation followed by DFT can be interpreted as attempting to 
split the bandwidth of the original input signal (u(i)) into a set of 2B partially exclusive 
and equally wide bands by filtering through the {H;,(z)}. It is for this reason that the 
bank of filters {H;,(z)} is called the uniform DFT analysis filter bank. If we examine the 
unconstrained and constrained DFT-based adaptive structures of Figs. 28.1 and 28.2, we 
see that they both employ the DFT analysis filter bank at their inputs. Of course, since the 
bandwidths of the filters {H;,(z)} overlap, and since the side lobe attenuation in each filter 
is not high enough relative to the main lobe, the bands of the resulting signals (uj (n)) are 
not necessarily mutually exclusive or well separated. 


Analysis Filter Bank 
More generally, we may consider employing different choices for the prototype filter 
Ho(z), and also even change the number of analysis filters used as well as the value of 
the decimation factor B. These choices would be guided by the desire to arrive at filter 
structures that result in better band partitioning. 

So assume we select some prototype filter H(z) and use it to define an analysis filter 
bank with K filters as follows: 


Hy(z) = H(ze 7" K),  kz0,1,...,K-1 (28.17) 
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FIGURE 28.5 A uniform DFT-based analysis filter bank. 


i.e., with frequency responses 
Hy(e/^) = H(e^7)), wp =2rk/K, k-0,1...,K-1 


In this case, the interval [0, 27] is divided into K equally spaced subbands of width 27 / K 
each. Now since the signals at the outputs of the {H;,(z)} have bandwidths that are smaller 
than the bandwidth of the fullband signal (u(i)), the outputs of {H;,(z)} can be decimated 
(i.e., down-sampled to a lower rate), say, by some factor D, prior to further processing. The 
resulting structure is shown in Fig. 28.5. It is still called a uniform DFT-based analysis fil- 
ter bank; the difference now, compared with Fig. 28.4, is that the number of analysis filters, 
K, and the decimation factor, D, are not necessarily related as before. Moreover, the choice 
of H(z) is also left to the designer. The analysis filter bank (H(2), H1(z),..., Hk ai(2)) 
is usually chosen such that the bandwidth of the input signal u(i) is divided uniformly into 
K bands of equal bandwidths. In this way, the bandwidth of the output of each of the anal- 
ysis filters is K times smaller than the bandwidth of the original signal u(z) and, therefore, 
we can decimate the outputs of the analysis filters by some factor D < K. 

The special choice D = K results in what is called a maximally or critically decimated 
analysis filter bank. This is because, in this case, the total number of samples after decima- 
tion is the same as the total number of input samples. Thus, observe that N input samples 
(u(i)) lead to N samples at the output of each analysis filter and, therefore, to a total of 
KN samples at the output of these filters. After decimation by a factor D = K, we are left 
again with N samples. Choices of D such that D « K, on the other hand, lead to what are 
called oversampled or noncritically sampled analysis filter banks. In these cases, V input 
samples generate more than N total samples at the output of the decimators. If D > K, 
then aliasing occurs in the subbands. 


Synthesis Filter Bank 

The step of recovering the input signal (u(i)) from its subband components (uj (n)) in 
Fig. 28.5 is known as synthesis. To do so, the signals (uj (n)) are first upsampled by 
appending D — 1 zeros after each sample, i.e., the (u; (n)) are transformed into (5, (i) 
as follows: 


u(n) — TD — sp (i) 


EH eae, 


FIGURE 28.6 A cascade of analysis and synthesis filter banks. 


where s, (i) is related to u; (n) via 


Hii = { u, ($)  ifi/D is an integer 
0 otherwise 
Now since, for any interpolator, the spectrum of its output signal consists of D repetitions 
of the spectrum of its input signal (see Prob. VI.17), we need to filter {s/,(i)} by some 
lowpass filter in order to remove the repetitions in the spectrum. 
Thus, select some lowpass prototype filter R(z) and use it to define a synthesis filter 
bank with K filters as follows: 


Rą(z) ngos FSO UE 
i.e., with frequency responses 
Ry (e?") = Ride»), Uy —-2nk/K, k=0,1,...,.K—-1 


Considerable investigations in the literature have gone into the design of the prototype fil- 
ters ( H(z), R(z)}, and consequently into the design of analysis and synthesis filter banks, 
in order to reduce the distortion and delay between the input and output nodes (such as 
the design of perfect reconstruction filter banks). Such studies are well-documented in 
textbooks on multi-rate discrete-time signal processing (see, e.g., Vaidyanathan (1993)). 
When the subband signals are modified (due to overlaps in their spectra), cross-filters may 
be needed in order to reduce signal distortion (see, e.g., Gilloire and Vetterli (1992)). 


Structures for Subband Filtering 
Figures. 28.7 and 28.8 show two basic configurations for adaptive subband filtering, com- 
monly referred to as open-loop and closed-loop configurations. 

In Fig. 28.7, the input sequence {u(i)} and the reference sequence (d(?)) are processed 
by the same analysis filter bank to produce subband signals (uj (n), d, (n)). The subband 
signals are then used to train adaptive filters (I5, .1) by using the subband error sequences 


ex(n) = di(n) — v(n) 
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where the (y; (n) are the outputs of the subband adaptive filters. These outputs are in turn 
applied to a synthesis filter bank to generate the signal á(i — A), for some delay A. The 
reason for the presence of A is because filter banks introduce delay in the signal path. One 
major drawback of this open-loop structure for adaptive filtering is that it usually results 
in higher mean-square error performance since the algorithm is attempting to minimize 
the variance of the subband errors (e; (n)), and not the variance of the fullband errors 
(e(i) = a(i) — d(i)}. 

In Fig. 28.8, on the other hand, the error signal e(i) is evaluated in fullband and then 
converted to subband by means of the analysis filter bank. In other words, the reference 
sequence d(i) is now kept in fullband. Although this scheme tends to show better mean- 
square performance than the open-loop scheme due to the fullband error feedback mecha- 
nism, this improvement comes at the expense of a delay in the error feedback path, which 
degrades the convergence performance. In particular, observe that there is the delay intro- 
duced by the two filter banks between the outputs of the subband adaptive filters and the 
subband errors. 

Another problem with both the open- and closed-loop subband structures is the lack 
of optimality in their construction. As we have seen throughout the discussions so far 
in the book, adaptive structures are usually derived as approximations to some optimality 
criterion, such as the mean-square error criterion. However, this is not generally the case 
with the structures of Figs. 28.7 and 28.8. This is because they are motivated mostly by the 
desire to partition the input signal bandwidth into smaller “separate” bands over which the 


Analysis filter bank Synthesis filter bank 


dxi(n) 


FIGURE 28.7  Open-loop structure for subband adaptive filtering. 
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FIGURE 28.8 Closed-loop structure for subband adaptive filtering. 


spectra are close to flat. There is no guarantee that these implementations will ultimately 
provide the error and estimate sequences {e(i), d()) that would result from an optimal 
or close-to-optimal mean-square-error formulation; contrast this situation, for example, 
with the DFT-based schemes of Algs. 28.2 and 28.3, which followed from the solution to 
(28.5). In addition, no constraints on the subband weight vectors are usually applied during 
adaptation of the open- and closed-loop structures. 


28.A APPENDIX: ANOTHER CONSTRAINED FILTER 


There is an alternative to the constrained filter of Alg. 28.2. Observe that the error vector ep,» in the 
algorithm is computed using the frequency-domain data (y; (n)). This error vector can be evaluated 
in another manner by convolving the input sequence {u(i)} directly with an estimate of the fullband 
filter G(z) itself. More specifically, from (28.15) we have that 


P0,n=1 laci 
Pin-1 Ies 
T u 
Dii nl [ is. ees |? iaa (28.18) 
PB-1,n-1 lip-ia-i 


This expression provides estimates for the polyphase components of G(z) at block iteration n — 1. 
Using the relation between G(z) and its polyphase components (Px(z)) from (27.6), the above 
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{px,n—1} can be used to estimate the impulse response sequence g itself. If we denote the estimate 
of g at block iteration n — 1 by w 1, then 


Pojn—1 
Plani 
T 
Wn-1 = vec P2,n-1 (28.19) 


T 
DB-1,-1 


In other words, we simply stack the columns of (28.18) on top of each other. Then, at block iteration 
n, We can generate the error signals as follows: 


d(nB + j) = UnB+jWn-1 (convolution in fullband) (28.20) 
e(nB + j) = d(nB +j) —d(nB +å), j20,1...,B-1 i 


where 
unati =| u(nB+j) u(nB+j-1) ... u(nB+j-M+}) | 


is the regressor of the fullband filter at time nB + j. The sequence (e(i)) so generated is then 
decimated by a factor of B to construct eg,n. The resulting filter structure is shown in Fig. 28.9. 
Compared with Fig. 28.2, we see that only two DFT transformations are used, in addition to a 
fullband convolution. The constraint (27.20) is enforced in the process of mapping the subband filters 
{li,n-1} into the fullband filter w,—1 according to (28.19). We therefore find that the constrained 
filter of Alg. 28.2 can also be implemented as Alg. 28.3. 

We still need to discuss how to evaluate the inner product d(nB Tj) = UnB+jWn-1 in an 
efficient manner. Otherwise, since the vectors involved are M —dimensional, the resulting filter im- 
plementation ends up being inefficient and we defeat our original objective of reduced computational 
cost. Now we already know from Sec. 27.2 how to evaluate inner products (or convolutions) effi- 
ciently by using Alg. 27.1. However, two issues arise here that were not relevant before: 

a) First, if we employ the structure of Fig. 27.8 to evaluate the inner product d(nB +j) = 
UnB+jWn-1, With wn—1 playing the role of g, then we would end up with two frequency- 
domain structures: one is used for efficient convolution while the other is used for block filter 
adaptation, as in Fig. 28.10. Obviously, there is no need for the block sizes in both structures to 
be identical. For instance, the block size for adaptation could be B and the one for convolution 
could be some other value R, with the constraint that B is a multiple of R, e.g., B = mR for 
some integer m. The structure of Fig. 28.10 then functions as follows: 


1. Starting from the subband weights {lx,n—1 }, an estimate for the fullband filter is com- 
puted, i.e., wn—1. The polyphase components of order M/R of wn-1 are determined 
and used via (27.24) to find the corresponding bank of 2 subband filters for the con- 
volution structure, which we denote by {Lx,n-1(z)}- 


2. At block iteration n, by the time the adaptive structure finishes forming the n-th block 
of data, the convolution structure would have operated on m sub-blocks of size R each 
and generated the outputs {d(nB+ j), j = 0,..., B — 1) (which are the entries of the 
block vector de,n). These signals are used by the adaptive structure to compute ep,» 
and to update the subband weights to {Ix,n}. 

b) The polyphase components (Z4: (2)) of the convolution structure are updated after every 
block iteration, i.e., after every new estimate wn is generated. The states (i.e., initial condi- 
tions) of the subband filters within the convolution structure should of course propagate from 
one iteration to another. 


— 


One advantage of the constrained filter of Figs. 28.9 and 28.10 over the constrained filter of Fig. 28.2 
is that the delay problem associated with the convolution structure of Fig. 28.10 can be addressed by 
using the modifications suggested in Prob. VI.14. 


Algorithm 28.3 (Another constrained DFT block filter) Consider the adaptive 
channel estimation scheme shown in Fig. 27.2, where wi-1 is M x 1 and the 
regressor u; is 1 x M. For relatively long channels (i.e., large M), a more efficient 
adaptive procedure for generating the estimates d(i), and the corresponding error 
sequence {e(i)}, is the following. Choose a block size B; usually M and B are 
powers of 2 and M/B is an integer. Select 0 « @ < 1, set [, 1 = lg, 1 = 0, 
w-1 = 0, u-1 = 0, up, -1 = 0, Ax(—-1) = € (a small positive number), and repeat 
for n > 0: 


Filtering in fullband. Generate the entries of ep, by computing for j = 0,1,..., B— 


I: 

wes = [wnBej wnBej-1) .. ulnB+j~M+)) | 
d(nB+j) = UnB+jWn-1 
enB+j) = d(nB4j)-d(nB-4j) 


Block input and error signals. Construct the block vectors: 


UB, = col{u(nB+B-1),...,u(nB)} 
U2Bn = col{us,n, UB,n-1) 
eB, = col(e(nB 4 B — 1),...,e(nB)) 


Frequency transformations. Perform the DFT transformations: 


/ / Ip 
UzgB, =  FuoB, €2pg,n = F | ÜSs | €B,n 
Bx 


and let (ej (n), uj (n)) denote the k-th entries of (655,4, u55,,). Let also 
Uk = | uj(n) u&(n—1) ... ukin- 1) 


Adaptation. For each subband filter k — 0,..., 2B — 1, adapt its coefficients: 


Ak(n) = A(n- 1)  (1— purln)? 
lk, = lk,n-1 t xa] ene en) 


Subband/fullband mapping. Set 
lo; 
wn = vec} [Is OsxalF 
lp-ia 


where the operation vec(-) stacks the columns of its argument on top of each other. 


28.B APPENDIX: OVERLAP-ADD BLOCK ADAPTIVE FILTERS 


The filters of Algs. 28.1, 28.2, and 28.3 are referred to as overlap-save filters. As explained in 
Remark 1 in the text, the term overlap-save is used to indicate that successive input blocks (u25,5) 
are overlapped prior to the DFT operation and only half of each block is saved. In this appendix, 


455 


SECTION 28.8 
OVERLAP-ADD 
BLOCK 
ADAPTIVE 
FILTERS 


456 


CHAPTER 28 
BLOCK AND 


SUBBAND 
ADAPTIVE 
FILTERS 


l 
! , 
i : 
F S Cd 
d B-1 z 
| [sh 
I 
t 
! : B 

t » : 
ujp-i(n) 1 €àp-i(n) : Jaros 

1 —_ 

2B33B E Nene Sa 1 2B x 2B 


FIGURE 28.9 An implementation of the constrained DFT block adaptive filter of Alg. 28.3. 
The input sequence (u(i)) is processed by a bank of 2B decimators, and is also convolved 
in fullband with wn-1 in order to generate d(i). The error sequence {e(i)} is processed by 
a band. of B decimators and used to generate the transformed errors (ej(n)) for subband 
adaptation. The mapping from the (I, 1) to wn—1 is done according to (28.19). 


we derive alternative overlap-add structures for DFT-based block adaptive filtering. In this case, 
the successive input blocks (u25,5) do not share a common sub-block. These structures can be 
motivated as follows. Rather than factor Q(z) as in (27.10) with {P(z), Q(z)) given by (27.13)- 
(27.14), we can equivalently factor it as 


G(z) = Q1(z)P:(z) 


where now, e.g., for B = 3 again, 


0 0 0 
Pala). 0 a ;3 0 0 100 

Pz)=| BO PO 0 | GG! o = 0 0 1 0| Q820 
Hote) Fae)” dan 0 0 z!001 


0 Pelz) Pr(z) 
0 0 P) 


Following the derivation in Sec. 27.2, we can embed P;(z) into the same circulant matrix C(z) in 
(27.15), and P: (z) can be recovered from the last columns of C (z) as 


Py(z) = C(z) | us | 


Mee Seas See Se See ee aS Sere ee LES 


discard 
last B 
outputs 


a i — 


B zeros 


2B x 2B [s 2B x2B 


FIGURE 28.10 A second implementation of the constrained DFT block adaptive filter of 
Alg. 28.3 with frequency-domain structures for both adaptation and convolution. The input 
signal u(i) is processed by two banks of 2R and 2B decimators for convolution and adaptation, 
respectively. The {lx,n-1} are the subband filters that correspond to the fullband weight vector 
wn-1, Which in turn is obtained by mapping the subband filters (14,41) according to the 
constraint (28.19). 


Then (27.21) becomes 


G(z) = Qi(z)C(z) | ui | = Qu) ree ae | (28.22) 


— amm C(z) 
Pi (z) 


This result shows that the mapping from ug,n to yp,» in Fig. 27.4 can be alternatively implemented 
as shown in Fig. 28.11 — compare with Fig. 28.5. Observe that the outputs of the transformation by 
F* are combined as shown in the figure to yield ya,n. The delays at the output of Fig. 28.11 can be 
moved prior to the subband filters ( Li. (z)). To see this, we first express O1(z) as 


Qi(z) = | Osxse Is + z7 [ Is Osxe | 
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Lag-i ET 


FIGURE 28.11 An alternative implementation of the mapping in Fig. 27.4 for block 
estimation; compare with Fig. 27.5. 


so that expression (28.22) for G(z) becomes 


G(2) = [05.5 Is ]reor| Osx | +27 | Ip Osx» J reer | Onxe | 


Ip Ia 
(28.23) 
Now it can be verified that (see Prob. VI.11): 
F| l5 | = gp} xe (28.24) 
OBxB Ip 


where J = diag(1, —1,1, —1,...,1, —1) is the 2B x 2B diagonal matrix with alternating +1’s. 
Substituting into (28.23) we find that G(z) can be rewritten more compactly as 


g(z) = [ Osexs Is J rears tne | 08x8 | (28.25) 


Is 


This result shows that the implementation of Fig. 28.11 can be redrawn as shown in Fig. 28.12, where 
the signals {s} (n)) are generated via 


sk(n) = u,(n) + (-1)^u4(n- 1, k=0,1,...,2B-1 (28.26) 


In summary, we find that the efficient block convolution of Alg. 27.1 can also be attained as 
Alg. 28.4. 


0 discard 
+S top B 
I outputs 
| 
1 
| 
1 
UB,n I 
——J»— YB,n 


FIGURE 28.12 The delays at the output of Fig. 28.11 are moved prior to the subband filters. 


Algorithm 28.4 (Block overlap-add convolution via DFT) Consider the mapping 


shown in Fig. 27.3 for some transfer function G(z) with M taps. The direct con- 
volution operation can be implemented more efficiently as shown in Fig. 28.12. 
Specifically, choose a block size B; usually M and B are powers of 2 and M/B is 
an integer. 


Bank of filters. Determine the polyphase components of G(z) of order M/B, as 
defined by (27.6), and use (27.23)-(27.24) to determine the 2B filters {L;(z)} of 
size M/B each. 


Block filtering. Start with ug, 1 = 0 and repeat for n > 0: 
1. Construct the block vectors: 
ug n = col{u(nB+B-1),...,u(nB)}, wean =col{0, ua,n} 


2. Perform the DFT transformation u55,, = Fu2zB,n and let 


82B,n = U2B.n + J'U3B,n-1 E col{s,(n), k — 0,1,...,2B — 1) 


Filter the entries {s,(n)} through the subband filters {LZ (z)}. Let y25,4 = 
col{y;,(n), k = 0,1,...,2B — 1) denote the outputs of these filters at iter- 
ation n. 


3. Perform the DFT transformation: 


x * 
| | =F Y2B,n 
YB,n 


where x denotes B entries to be ignored, and yan = col(y(nB + B — 
1),...,y(nB + 1), y(nB)}. 
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We can now proceed as in Sec. 28.1 and derive adaptive implementations as well. Since the 
arguments are similar to what we did before, we shall be brief and only highlight the differences. 
Introduce the 1 x M/B regression vector 


sen =| sk(n) sm-1 ... sn-M/B»1) | 


where the (5; (n)) denote the inputs to the subband filters, as defined by (28.26). Collect further the 
regressors {S} n ) into a block diagonal matrix $7, 


S, È diag{sh,n; S1 n... S2B-1n}; (2B x 2M) (28.27) 


Then the same argument that led to (28.3) gives 
dan = [ Ode: Íp ]rs dS. (28.28) 


Comparing with (28.3) we see that the main distinction is in the use of S; in place of Un. In ad- 
dition, [05x 5 Is] replaces [Ig Os x8]. Therefore, the results of Sec. 28.1 and also App. 28.A 
will extend with minor modifications. In particular, the unconstrained filter of Alg. 28.1 becomes 
Alg. 28.5. Figure 28.13 shows a block diagram representation of Alg. 28.5. In a similar manner, 
the constrained filters of Algs. 28.2 and 28.3 become Algs. 28.6 and 28.7, respectively, which are 
illustrated in Figs. 28.14 and 28.15. 


p ——————————————————————————————À 


Algorithm 28.5 (Unconstrained overlap-add DFT block filter) Refer to the set- 
ting of Alg. 28.1 and let J = diag(1, —1, 1, —1,..., 1, — 1) be a 2B x 2B matrix 
with alternating +1's. Repeat for n > 0: 


UB, = Ccol(u(nB +B — 1),...,u(nB)) 
wg. m r| 08x1 | 
UB,n 
S2Bn = Uo. + Juan- & col{sk(n), k=0,1,...,2B—1} 
Ae(n) = 8xn-1-ü-B)ls(n) k=0,1,...,2B-1 
bhn = EO Aes s.(n—¥+1) |, k 20,1,...,2B— 1 
yx (n) =. Sk nlk,n-1; k=0,1,...,2B-— 1 
dg. = [ Oexe Ip ] F'eot {uo(n), +s (0) 
EBn = dB, = dain 
an = F Pon | €B,n E col{e,(n), k=0,1,...,2B — 1) 
B 
pm H Jx f = ate 
lkn = lic X05) Ugnex(n), k=0,1,...,2B-1 


At each block n, the entries of (dn, €B,n} correspond to 


dg, = col{d(nB+B-1),...,d(nB+1),d(nB)} 
col{e(nB + B — 1),..., e(nB + 1), e(nB)) 


€B,n 


———— 


Algorithm 28.6 (Constrained overlap-add DFT block filter I) Consider the same 
setting of Alg. 28.2, and repeat for n > 0: 


col{u(nB + B — 1),...,u(nB + 1), u(nB)) 
r| OBx1 | 
UB,n 


i t / 
82B,) = U2Bn t JU2B,n-1 


UB,n 


1 
U2B,n 


2 col{s,(n), k =0,1,...,2B-—1} 
de(n) = BXk(n-1)—-(1—)si(n)?, k-0,1,...,2B-1 
sa = EC sss atn & +1) |, k20,1,...,2B - 1 
yn) = Sknlkn-1; k20,1...,2B-1 
dan = | Osxe Is | Feo! {yo(n),--.,y2n—1(n)} 
€B,n = dB,n = dan 
€5B,n = F | 08x8 Je 
Ip 
2 col(e(n), k =0,1,...,2B — 1) 
H In of 
n = n= TaN Uk,n " —0,1,...,2B - 1 
lk, lin-1 + m oe ex(n, k=0 
ha lj. 
ET. E 1 F* In Un 
: — 2B OBxB : 
cT T 
$51 ln-i 


At each block n, the entries of (ds... €B,n} correspond to 


dga = col{d(nB+B-1),...,d(nB+1),d(nB)} 
col(e(nB + B-1),...,e(nB+1),e(nB)} 


It 


€B,n 


-———— a n 
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Algorithm 28.7 (Constrained overlap-add DFT block filter Il) Consider again 
the setting of Alg. 28.3, and repeat for n > 0: 


Filtering in fullband. For j = 0: B — 1, generate the entries of the vector eg, by 
computing: 


UnB+j = [ u(nB +5) u(nB-j-1) ... u(nB+j-M+1) ] 
d(nB +j) = UnBsjWn-1 
e(nB+j) = d(nB-j)-d(nB-j) 


Block input and error signals. Construct the block vectors 


ugn = col{u(nB+B-1),...,u(nB)} 
U2B, = col{0,us,n} 
ean = col(e(nB 4 B — 1),...,e(nB)) 


Frequency transformations. Perform the DFT transformations: 


+ 
U2Bn = FuzB,n 
u + Li 
S2B,n =  UgB, t Jü2p,n-1i 
] OBxB 
€2B,o = r| I je 
B 


and let {e}, (n), s, (n)) denote the k-th entries of (625,4, 525,4). Let also 
Skin = [ s(n) shini) ... 8,(n-¥ 41) | 
Adaptation. For each subband filter k = 0,...,2B — 1, adapt its coefficients: 
A(n) = BXx(n-2-(- b)lsk (n)? 
lkn = lic agl) 
Subband/fullband mapping. Set 


ln 
wn = vec | {Ip Osxa]F 


n 
2B-1,n 
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FIGURE 28.13 Block diagram representation of the unconstrained overlap-add DFT block 
adaptive filter of Alg. 28.5. 


discard 
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FIGURE 28.14 Block diagram representation of the constrained overlap-add DFT block 
adaptive filter of Alg. 28.6. 
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B zeros 


—_— 


a8 x 28. knoe Sie Sea ae à 29 2B 


FIGURE 28.15 The constrained overlap-add DFT block adaptive filter of Alg. 28.7 with 
frequency-domain structures for both adaptation and convolution. 


Summary and Notes 


The chapters in this part describe several adaptive structures that are meant to reduce the 
computational requirements of LMS and improve its convergence speed. 


SUMMARY OF MAIN RESULTS 


1. The text describes three classes of efficient adaptive structures: transform-domain filters, 
block adaptive filters, and subband adaptive filters. 


2. Transform-domain filters exploit the de-correlation properties of some unitary signal trans- 
forms, such as the discrete Fourier transform (DFT) and the discrete cosine transform (DCT), 
in order to pre-whiten the input data and speed up filter convergence. Designing a transform- 
domain filter usually involves three steps: 


(i) Replacing the regression vector u; by a transformed version @; = u;T', where T is a 
unitary transformation. 


(ii) Employing a diagonal step-size matrix that consists of estimates of the inverse powers 
in the individual entries of @;. 


(iii) Making sure that the elements of the transformed regressor can be evaluated efficiently 
in O(M) operations. 


Steps (i)—(iii) only help improve the convergence speed of LMS; they do not result in a reduc- 
tion of computational complexity. Two types of transformed filters were discussed in the text: 
DFT-LMS and DCT-LMS. 


3. Block adaptive filters reduce the computational cost of LMS by some factor a > 1, while at 
the same time improving the convergence speed. This is achieved by processing the data on 
a block-by-block basis and by exploiting the fact that many signal transforms admit efficient 
implementations. Designing a block adaptive filter generally involves three steps: 


(i) Choosing a signal transform, such as the DFT, and choosing a block size and an order 
for the subband filters. The number of subband filters is related to the block size and to 
the type of signal transform used. 


(ii) Collecting blocks of input data and transforming them by the chosen signal transform. 
The manner by which the data blocks are processed depends on the transform. The 
transformed data then play the role of regression data to the subband filters. 

(iii) There are mainly two ways to obtain the error signals for training the subband filters: 
(iii.1) In one implementation, the estimates {d(i)} are generated by signal transforming 
the outputs of the subband filters. 
(iii.2) In another implementation, the estimates {d(i)} are generated through fullband 
convolution. 
The estimates {d(i)} are subtracted from the original reference sequence {d(i)} in 
order to generate the time-domain error sequence (e(i)), which is then transformed to 
the frequency domain and used to train the subband filters. 
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4. Two kinds of block adaptive implementations are described: unconstrained and constrained 
with the latter having superior performance. In addition, for the constrained case, two im- 
plementations are possible with {d(i)} generated either through convolution in fullband or 
through transforming the outputs of the subband filters. 


5. The derivation of block adaptive filters is carried out by using a convenient embedding, 
whereby a certain transfer matrix function is embedded into a larger matrix with special struc- 
ture. The structure is such that it can be diagonalized by the DFT. Other embeddings are 
possible that result in structures that are diagonalizable by other common transforms such 
as the DCT and DHT. The DFT-based block adaptive filters are derived in the text and in 
Apps. 28.A and 28.B. On the other hand, DCT and DHT-based structures can be found in 
Apps. 10.D and 10.E of Sayed (2003). 


6. Subband adaptive filters are related to the DFT-based block adaptive filters, except that they 
attempt to achieve better band partitioning of the data via selection of more suitable prototype 
filters for the analysis and synthesis banks. There are two basic configurations for subband 
filtering: open-loop and closed-loop. In closed-loop, the error sequence is generated in the 
fullband time-domain and then processed by the analysis filter bank. In open-loop, subband 
error signals are used to train the subband filters. 


BIBLIOGRAPHIC NOTES 


Transform-domain filters. The idea of a transform-domain adaptive filter was first proposed by 
Narayan and Peterson (1981) and Narayan, Peterson and Narasimha (1983). Another early work is 
by Marshall, Jenkins, and Murphy (1989), which provides a geometric interpretation of the effect of 
step-normalization on the performance of transform-domain LMS. The DCT-domain LMS filter of 
Sec. 26.3 was proposed by Beaufays and Widrow (1994) and Beaufays (1995); the form we presented 
in Alg. 26.3 is slightly different. 


Terminology. The block filters we derived in Sec. 28.1 by using the DFT transform are generally 
referred to in the literature as “block adaptive filters" or “block LMS filters". We have opted to use 
a more specific terminology and attach the "DFT" qualification to the name. Thus we refer to them 
as "DFT-based block adaptive filters", *DFT block adaptive filters" or simply "DFT block LMS 
filters". We do so in order to distinguish these forms from other block adaptive filters proposed by 
Merched and Sayed (20002), and are described in Apps. 10.D and 10.E of Sayed (2003). These al- 
ternative forms are based on other signal transforms, such as the discrete-cosine and discrete Hartley 
transforms. Thus one can refer to DFT-, DCT-, and DHT-block adaptive filters. 


Block adaptive filters. In our treatment, we study directly block adaptive filters with multiple taps 
in the subbands (i.e., multi-delay filters). This is not how these filters were historically developed. 
The earliest block filters were derived in the early 1980s with a single tap in the subbands by Ferrara 
(1980) and Clark, Mitra, and Parker (1981). The multi-delay filters appeared later and were derived 
independently by Asharif et al. (1986a,1986b), Soo and Pang (1987,1990), and Sommen (1989), 
under different names (specifically, frequency bin adaptive filter, multi-delay frequency block LMS, 
and partitioned frequency block LMS, respectively). Other early works on block filters include those 
by Xu and Grenier (1989), Petraglia and Mitra (1991,1993), Shynk (1992), and Borallo and Otero 
(1992). All these works were specific to the DFT domain. 


Embeddings. The presentation in Chapters 27 and 28 follows Merched and Sayed (20002), where 
DCT- and DHT-block adaptive filters are also derived. The arguments in these chapters, even for 
the DFT case, are different from the original derivations of multi-delay filters. Specifically, the 
derivation in Sec. 28.1 relies on embedding certain transfer matrices into larger structured block 
matrices that can be diagonalized by the DFT matrix and also by other trigonometric transforms 
(using, e.g., results from Bini and Favati (1993) and Heinig and Rost (1998) on structured matrix 
computations — see also Kailath and Sayed (1995)). One useful consequence of the embedding 


approach is that it can handle extensions of the block adaptive structure to other transform domains 
such as the DCT and the DHT (see Apps. 10.D and 10.E of Sayed (2003)). Similar embedding 
techniques in the DFT domain were used before by Lin and Mitra (1996) in the design of efficient 
block implementations for digital filters. It should be noted that the DCT- and DHT-based schemes 
require only real arithmetic and, since efficient algorithms exist for computing the DCT and the DHT 
(see, e.g., Vetterli and Nussbaumer (1984) and Ersoy (1997)), these schemes also lead to efficient 
adaptive structures. 


Power normalization. The use of NLMS, or of a power-normalized version of it as in Sec. 28.1, 
in order to train the subband filters, can improve the rate of convergence of block filters as indicated 
by the analyses in Sommen et al. (1987) and Lee and Un (1989). 


Constrained vs. unconstrained implementations. In Mansoni and Gray (1982), it was shown 
that the unconstrained DFT-based block adaptive filter can still work well if the filter length M is suf- 
ficiently large. Still, unconstrained implementations are slow to converge (especially for M/B > 1) 
and they do not converge to the optimal solution. On the other hand, it has been verified experimen- 
tally that the constraint does not need to be applied at every iteration. 


Subband adaptive filters. The concept of subband adaptive filtering was introduced by Furukawa 
(1984) and Kellermann (1984,1985). Other early works in this area include Yasukawa and Shimada 
(1987,1993), Chen et al. (1988), Gilloire and Vetterli (1988,1992), and Petraglia and Mitra (1993). 
Some further results on subband structures can be found, e.g., in Morgan and Thi (1995), Pradhan 
and Reddy (1999), Farhang-Boroujeny (1999), Petraglia, Alves, and Diniz (2000), and the references 
therein. The work by Morgan and Thi (1995) provides a useful approach to handling the delay 
problem that plagues subband architectures — see Prob. VI.14. The book by Farhang-Boroujney 
(1999) provides a good treatment of subband adaptive filters with emphasis on design issues. 


Block and subband filters. When the open- and closed-loop subband architectures of Sec. 28.2 
were proposed in the literature, the DFT-block adaptive filters of Sec. 28.1 did not even exist. So 
the derivations of these subband structures were not originally motivated in the same manner we 
discussed in Sec. 28.2, by showing how block and subband adaptive filters relate to each other. In 
fact, one can verify that even the technical language used in the derivation of the DFT-block adaptive 
structure, e.g., in Soo and Pang (1987,1990), has little resemblance with the language of modern 
multi-rate systems and filter banks (e.g., as in Vaidyanathan (1993)). 


An application to acoustic echo cancellation. As mentioned in the concluding remarks of 
Part IV (Mean-Square Performance), there are two types of echoes in communications systems: line 
echoes and acoustic echoes. In Computer Project IV.1 we studied line echo cancellation and in Com- 
puter Project VI.1 we shall study acoustic echo cancellation using block adaptive filters. Acoustic 
echoes occur in hands-free telephony and teleconferencing and they result from the reflection of 
sound waves inside an enclosure, and from the acoustic coupling between the microphone and the 
loudspeaker (see, e.g., Benesty et al. (2001)). Compared with line echo problems, the duration of the 
echo path tends to be longer for acoustic echoes. This is one reason why block and subband adaptive 
filters are popular choices for acoustic echo cancellation applications. 
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Problems and Computer Projects 


PROBLEMS 


Problem VI.1 (Similarity transformations) Let A and B be two similar matrices, i.e, B = 
T~1AT for some square invertible matrix T. Show that A and B have the same eigenvalues. 


Problem VI.2 (Nonnegativity of power spectra) Let r(k) = Eu(z)u*(i — k) and assume 
this auto-correlation sequence is exponentially bounded as in (26.2). Introduce the corresponding 
power spectrum (26.3). Let Rm be the Hermitian Toeplitz covariance matrix whose first column is 
col{r(0),...,7(M — 1)). Pick any finite scalar a and define b = col(a, ae 77^, . .., ae ?(M- v»), 
where j = \/—1. Show that 


1 M-1 1 Mo 

* am —jwk 2 —jwk 
0x x? Rub=a 1 r(k)e a — jkl- Ja|^r(k)e 7 
Now take the limit as M — oo to conclude that S,(e/*) > 0. 


Problem VI.3 (Diagonal weighting matrix) Consider the stochastic version of the general trans- 
form domain LMS recursion stated in Alg. 26.1, namely, 


üi = wil, Ui(k)-k-thentry of a; 
Arli) = BAr(i-—1) + (1— B);(K)", &=0,1,...,M—1, Di =diag{Ax(i)} 
@ = Wir + pD uj|d(i) - uii] 


Let [Ra]x, denote the k—th diagonal entry of the covariance matrix Ra = E uj ü;. Argue that, 


ED; — diag{ (Ra]oo; [Ralij, ...,[Ralm—1,m—i1} E diagonal{ Ra}, as i— oc 


In other words, E D; tends to a diagonal matrix that agrees with the diagonal of Ra. 


Problem VI.4 (Mean-square performance) Consider the same setting of Prob. VI.3 and as- 
sume the data (d(i), ui} satisfy model (15.16). Follow the arguments used in Sec. 19.1 while eval- 
uating the performance of RLS to justify the following approximation for the excess mean-square 
error of the general transform-domain LMS algorithm: (*'9**form LMS — uo? M/(2 — uM). 


Problem VI.5 (Tracking performance) Consider the same setting of Prob. VI.3 and assume the 
data (d(i), wi) satisfythe nonstationary model (20.16) with a sufficiently small degree of nonsta- 
tionarity. Follow the arguments used in Sec. 21.4 while evaluating the performance of RLS to justify 
the following approximation for the excess mean-square error of the general transform-domain LMS 
algorithm: 7 
questo ims _ HM + p Tr (Q: diagonal{Ra}) 

2—uM 


where Q = T*QT. 
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Problem V1.6 (Circulant matrices) Consider a 4 x 4 circulant matrix C and the 4 x 4 DFT 
matrix F, namely, 


co Cı c2 C3 
c3 Co €1 C2 ^ 
Cue] 9 9 0 [Fla 967 3 , m,k=0,1,2,3 
C2 C3 Co (01 
Cı C2 C3 Co 


(a) Show that the columns of F* are eigenvectors of C. What are the corresponding eigenvalues? 
Conclude that C is diagonalized by the DFT matrix. 


(b) Replace C by a circulant matrix function, C(z). Use the result of part (a) to argue that C(z) is 
also diagonalized by the DFT matrix. 


Problem VI.7 (Diagonalization of circulant matrices) Let C = F* DF denote the diagonal- 
ization of an M x M circulant matrix C by means of the DFT matrix F. Collect the entries of the 
diagonal matrix D into a column vector d = diag{D}. Verify that d is proportional to the DFT of 
the first column of C, specifically, d = F'Cei/M, where e; is the first basis vector. 


Problem VI.8 (Order of subband filters) Verify that the DFT matrix (27.17) satisfies FF* = 
2B-I2B, and use (27.18) to write £(z) = 35s: FC(z)F". Conclude that each Lx (2) is a linear com- 
bination of the polyphase components {P;(z)}. Conclude further that each Ly (z) is a polynomial 
in z^! with M/B coefficients. 


Problem VI.9 (Constraint condition) Use F = F and (27.18) to write C" (z) = FL(z)F*. 
Multiply this equality by col(1,0,...,0) from both sides to conclude that (27.20) holds. 


Problem VI.10 (Relating fullband and subband filters) Refer to the discussion in Sec. 27.3 
and to the relation between the polyphase components {P,(z)} and the subband filters {Z,(z)}. 
Assume B = M so that the filters {P,(z), L,(z)} are all constants, namely, P&(z) = g(k) and 
Li(z) = l(k). Here, the (g(k), k = 0,1,..., M — 1) denote the impulse response coefficients 
of the wideband filter G(z) in (27.2). Moreover, the (I(k),k = 0,1,...,2M — 1} denote the 
coefficients of the subband filters {L+ (2)). Follow the argument that led to (27.20) to show that the 
fullband coefficients {g(k)} and the subband coefficients (1(k)) satisfy the following relations: 


10) 9(0) 
(1) 1 : MN aeercvons cm 
A F* = LF|gM-1) 
: 2M g(M —1) 2M 
l(2M — 1) Omx1 e) 


Problem VI.11 (Property of the DFT) Use expression (27.17) to establish (28.24). 


Problem VI.12 (Block LMS) Recall from the discussions in Sec. 27.1 that our motivation is to 
estimate the weight vector g in (27.1) in an efficient manner. The block recursions (28.6)-(28.8) 
provide one efficient DFT-based solution. Assume B = M, in which case the subband filters 
(Ix, 1) in Fig. 28.1 have a single coefficient each. Let wn denote the M x 1 estimate of the weight 
vector g at block iteration n. In view of the result of Prob. VI.10, the entries of the M x 1 time- 
domain estimate wn and the entries of the 2M x 1 frequency-domain estimate Wn can be assumed 
to be related to each other as follows: 


x. (0) wn(0) 25) 
w,(1) 1 : E ee 
= ——F* : = — F| wr(M —-1) 
: 2M wn(M — 1) 2M 
Wn (2M m 1) Omx1 wat 
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Use this relation to establish that the DF T-based block recursions (28.6)-(28.8) are equivalent to the 
following (time-domain) block LMS recursion: 


M-1 
Un = Wn-itp > unm+ie(nM +i) 
i=0 
e(nM +i) = d(nM+i)—unmsitn-1 
UnM+i = | u(nM +i) u(nM +i- 1) sun - M i 1) | (1x M) 


Remark. Typically, in a block LMS implementation, the gradient vector used to update the block weight estimate 
Wn—1 tO Wn is an average over several instantaneous approximations, say 

B-1 

Wn =UWn-1 +H 5 un p+ie(nB +i) 

i=0 
for some B < M, where B is referred to as the block length and un g+: is still 1 x M. Note that all error 
signals {e(nB + i)) over a block are computed using the same weight estimate wn—1. Moreover, while this 
problem considered the special case B = M, the discussion in Chapter 28 studies DF T-based algorithms for the 
more general case B < M when the subband filters {lk ,n—1 } in Fig. 28.1 have M/B coefficients each. 


Problem V1.13 (Mean-square performance of block LMS) Consider a block LMS stochastic 
recursion of the form 
B-1 
Wn = Wn- +H 3 Unp+i[d(nB + i) — up iwi] 


i=0 


with block length B < M, and where wn is M x 1 and ungs: is 1 x M. Moreover, n is the 
block index, so that each update is performed after the collection of B data points. Assume the data 
(d(i), ui} satisfy model (15.16). 


(a) Verify that the filter recursion can be expressed in the form 


" 
Wn = Wn-1 + HUREB,n 


where 
UnB+B—1 e(nB+ B-1) 
Un = : (B x M), EBn = 5 dB,n ~Unwn-1 
UnB+1 e(nB +1) 
UnB e(nB) 


(b) Let Ùn = w? — wn. Introduce the a posteriori and a priori error vectors €p,n = U nÙn and 
Can = U 4041, respectively. Assuming U n is full rank, establish the energy conservation 
relation: 


læn? + eo n(UnUn) “ean = lön- ez (UnUz) epn 
(c) Argue that in steady-state, and under expectation, the relation of part (b) leads to 
2E |jea,n||? = wos BTr(Ru) + uTr[E(eanesnUnU;)], noo 


(d) Assume that, in steady-state, E e5,5e7,, % (Elea(oo)|”) - I for small p, and U n is indepen- 
dent of €a,n (separation principle). Conclude from part (c) that the EMSE of the block LMS 
filter is approximated by 

cblock LMS pos Tr(Ru) 
2 — pTr( Ru) 


Problem VI.14 (Delayless implementation) There is an alternative implementation that ame- 
liorates the delay problem associated with the frequency-domain structure of Fig. 27.8. Consider the 


structure shown in Fig. VI.1 (for the special case B = 3). The input sequence {u(i)} is convolved 
directly with the first B — 1 coefficients of G(z), i.e., with the transfer function {g(0) + g(1)z7' + 
... + gB-22 P7? }, in order to generate an intermediate sequence (z(2)). The remaining part of 


G(z), namely 


is implemented block-wise as in Fig. 27.8 by using instead the subband filters that correspond to 
G p(z); this structure is depicted in Fig. VI.1 in terms of serial-to-parallel and parallel-to-serial 
conversion blocks (representing the banks of decimators and interpolators of Fig. 27.8). The B- 
dimensional vector at the input of the parallel-to-serial converter is denoted by zi. Assume B = 3. 


Block processing 
using {F, F*} 
and the subband 
filters of Gg(z). 


P/S 


FIGURE V1.1 A delayless alternative to the block implementation of Fig. 27.8. The S/P block employs 2B 
decimators at the downsampling rate of B each, and the P/S block employs B interpolators at the upsampling 
rate of B each. The block processing relies on 2B x 2B DFT matrices {F, F*} and on 2B subband filters 
corresponding to G p (z) and, therefore, of order (M — B + 1)/B each. 


(a) Verify that, at any particular instant i, 


g(2)u(i) + g(3)u(i — 1) +... 
z(i) = g(0)u(i) + g(1)u( — 1), zi ^ | g(2)u(i — 1) + g(3)u(i — 2) +... 
g(2)u(i — 2) + g(3)u(i — 3) +... 


(b) Adding the output of the direct convolution path to the output of the parallel-to-serial con- 
verter, verify that 


y(t + 2) x(i+ 2) 
y(i+1) = z(i-1) | +2 
y(i) z(i) 


g(0)u(i + 2) + g(1)u(i + 1) + g(2)u(i) + g(3)u(i — 1) +... 
g(0)u(i + 1) + guli) + g(2)u(i — 1) + g(3)u(i — 2) +... 
g(0)u(i) + g(1)u(i — 1) + g(2)u(i — 2) + g(3)u(i — 3) +... 


Remark. In other words, y(i) is generated at the output of Fig. VI.1 in response to u(i) and so forth. In this way, 
the delay problem of Fig. 27.8 is removed. For further details on the results of this problem, see Morgan and Thi 


(1995). 


471 


aes 
Part VI 


PROBLEMS 


472 


Part VI 
PROBLEMS 


Problem VI.15 (Unconstrained block filter) Refer to recursion (28.12) for the estimation of 
the subband adaptive filters in an unconstrained implementation. Define 


Bn (usin) 
i ul" 
Wa " n ; Un = ( iim) 
B-1n (udo i4)" 


as well as 
En = diag(eo(n), e1(n),...,eop-1(n)) and An = diag(Ao(n), A1(n),..., A2B-1(n)} 


where {Wn,Un} are 2B x M/B and (£4, An} are 2B x 2B. Verify that the 2B recursions (28.12) 
can be grouped together into a single matrix recursion as follows: 


eno = dpn -|1s Onxa ]F'diag (Wass) 


€2B,n =F | Ip | €B,n 


0B8xB 
Wn = Wn-1 + UAn EnUn, 


W-ı=0 


where the notation diag{ A}, for a matrix A, is a vector with the diagonal elements of A. 


Problem VI.16 (Constrained block filter) Refer to definition (28.16) for the constrained weight- 
vector estimates, and define the weight matrix 


ist 
2B-1,n 


Use the result of Prob. VI.15 to show that the constrained estimates {If „ } satisfy the matrix recur- 
sion: 
eB,n = de,n — | Ip Opxe | F*diag (Wi i42) 


| 


Clearly, the factor 1/2B can be incorporated into jz. Verify further that {Wn , Wr} are related via 


| Is | FAZ EnUn, We =0 
0BxB 


Is FW, 
OBxB 


Problem VI.17 (Decimators and Interpolators) Consider the interpolator shown below: 


s(n) —[ 1 B |— y0) 


Show that the z—transforms of the sequences (z(n), y(i)) are related via Y (z) = X(z?). Con- 
clude that Y (e*) = X(e/*P) and that, therefore, the frequency spectrum of {y(i)} is obtained 
by compressing the frequency spectrum of (z(n)) by a factor B and repeating it periodically over 
[0, 27]. Conclude further that the spectrum of the interpolated signal (y(i)) is periodic with period 
2n/ B. 

Likewise, consider the decimator shown below 


zli) | 1B |— y(n) 


Show that the z—transforms of the sequences {x(i), y(n) } are now related as follows: — AT3 
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-1 
Y(z)= 1 >D X (e Pe I TRB) 
per] 


dl 


so that 


wo 2nk 


, 1 B-1 
Ye) = 5 Xe *) 
k=0 


In other words, conclude that the spectrum of the decimated signal {y(n)} is obtained by expanding 
the spectrum of z(i) by a factor B and scaling its amplitude by B as well. Conclude that in order to 
avoid aliasing after decimation, the spectrum of the original signal (z()) should be limited to the 
interval [-7/ B, r/ B]. Figure VI.2 illustrates the spectra of the original, decimated, and interpolated 
signals for the case B — 2. 


4 X (ei) 
1 
-2r ER, cup Wp T 2r w 


-2r -—-T -2up 2up - 2r w 


4 Y(e!*) (interpolated by a factor of 2) 
1 


FIGURE VI.2 Spectra of the original (top), decimated (middle), and interpolated (bottom) signals using B = 
2. 


COMPUTER PROJECT 


Project Vi.1 (Acoustic echo cancellation) Acoustic echo cancellation is a common problem 
in hands-free telephony, where a person moves freely in a room while talking and listening to a 
remote speaker. With a loudspeaker and a microphone installed in the same room, as indicated by 
Fig. VI.3, undesired echoes over the walls interfere with the signal from the local speaker at the mi- 
crophone. The task of an acoustic echo canceller is to estimate the echo path, reconstruct the echoes, 
and subtract them from the microphone signal; thus leaving only the signal by the local speaker. 
Acoustic echo cancellation shares many commonalities with line echo cancellation, as studied in 
Computer Project IV.1. One main difference is that the length of the acoustic echo path tends to be 
considerably longer than the length of the line echo path. For this reason, block and subband adaptive 
filters are useful for applications involving acoustic echo cancellation. 


474 (a) 


Part VI 
PROBLEMS 


(b) 


Loudspeaker 


signal 


FIGUR 


(c) 


(d) 


(e) 


Load the file room.mat, which contains 1024 samples of a measured impulse response se- 
quence of an echo path in a room. Plot the impulse and frequency responses of the echo 
path. 

Load the file css.mat, which contains 5600 samples of a composite source signal. As ex- 
plained in Computer Project IV.1, this is a synthetic signal that emulates the properties of 
speech. Concatenate 20 such blocks to form a loudspeaker signal and feed it into the echo 
path. Train an acoustic echo canceller with 1024 taps using e-NLMS with e = 107° and 
step-sizes p € {0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 1.9}. Use the echo as a reference se- 
quence and the loudspeaker signal as the input to the adaptive filter. Plot the loudspeaker 
signal, the echo at the microphone, and the error signals that remain after adaptation for both 
pe = 0.5 and u = 0.1. For each step-size, compute the relative filter mismatch defined as 
S = ||ww — room||?/||room||?, where ww is the filter estimate at the end of adaptation. Plot 
S as a function of the step-size. Which step-size results in smallest mismatch? 


Room v(i) 


E VI.3 — Adaptive acoustic echo reflections in a room containing a loudspeaker and a microphone. 


Limit the length of the loudspeaker signal to N = 65536 and choose as block length B = 32. 
Train the constrained DFT-block adaptive filter of Alg. 28.7 with 64 subband filters using 
both NLMS with power normalization as in (28.12) and standard NLMS as in (28.13). In the 
latter case set the step-size at y = 0.1 and in the former case set the step-size at 4 = 0.1/32, 
where 32 is the length of each subband filter. Plot the loudspeaker signal, the resulting echo 
at the microphone, and the error signals that remain after adaptation by both algorithms. Plot 
also the resulting mismatches. 


Load the file speech.mat and treat it as the signal from the loudspeaker. Adjust its length to 
a multiple of the block length B = 32 and repeat it 20 times for a longer simulation period. 
Train the acoustic echo canceller using the constrained DFT-block adaptive implementation 
of Alg. 28.7 with = 0.03/32 and listen to the loudspeaker, echo, and error (clean) signals. 
Also plot the signals. 


Using CSS data again, generate a learning curve for e-NLMS using u = 0.5 and another 
one for the constrained DFT-block adaptive implementation of Alg. 28.7 using B = 32 
and u = 0.1/32. Use as loudspeaker data a first-order auto-regressive process generated 
by filtering white Gaussian noise through 1/(1 — 0.95271). Normalize the variance of the 
loudspeaker signal to unity. Generate the curves by averaging over 50 experiments and by 
smoothing the resulting ensemble-average curves using a sliding window of width 10. Set the 
noise power at 30 dB below the input signal power. 
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Least-Squares Criterion 


The earlier parts of this book dealt extensively with the problem of linear least-mean- 
squares estimation, whereby one random variable is estimated from observations of an- 
other correlated random variable. For example, in Sec. 8.1 we studied the problem of 
estimating a zero-mean random variable d from a zero-mean random row vector u, by 
seeking the optimal column vector w that solves 


min E|d — uw|? (29.1) 


The optimal estimator was found to be 
d= uw? where w° — R] Rau (29.2) 
in terms of the second-order moments of {d, u}, namely, 


R,=Eu*u>0 and Ra = Edu* 


The resulting minimum mean-square error was further seen to be given by 


m.m.s.e. = c2 — Rua R} Ra, = c2 ~ c? 
where c2 = E |d|? and c? = E|d|?. 

We proceeded in Chapter 8 to devise steepest-descent schemes for evaluating w° iter- 
atively, and in Chapter 10 we showed how stochastic gradient algorithms can be used to 
approximate w°, also iteratively. These latter algorithms were aimed at situations where 
access to the moments { Ra,, Ru} was not readily available, but only access to realizations 
{d(i), uj) of the random variables (d, u}. One such stochastic gradient method was the 
LMS algorithm (cf. Sec. 10.2): 


wi = wii pu [d(i) -—uswi-i], w-1=0 
Another stochastic gradient method was the exponentially-weighted RLS algorithm de- 
scribed in Sec. 14.1: 


= A1P,-1ufuiPi-1 
1+ A`tu; Piu? 
Wi = Wi-1 +F; uz(d(i) = uiWi-1], w- =0 


P, = A`? |P y Jue 


for some 0 < A < 1. Both LMS and RLS were motivated and derived in Chapter 10 by 
appealing to instantaneous-gradient approximations. 
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The purpose of Chapters 29-43 is to study the recursive least-squares algorithm in 
greater detail. Rather than motivate it as a stochastic gradient approximation to a steepest- 
descent method, as was done in Sec. 14.1, the discussion in these chapters will bring forth 
deeper insights into the nature of the RLS algorithm. In particular, it will be seen in Chap- 
ter 30 that RLS is an optimal (as opposed to approximate) solution to a well-defined opti- 
mization problem. In addition, the discussion will reveal that RLS is very rich in structure, 
so much so that many equivalent variants exist. While all these variants are mathematically 
equivalent, they vary among themselves in computational complexity, performance under 
finite-precision conditions, and even in modularity and ease of implementation. 

In this chapter, we start by studying the famed least-squares criterion in its own right. 
The results obtained here will then be used in Chapter 30 to establish that RLS is in effect 
a recursive solution to this criterion. In our presentation, we have chosen a treatment that 
is rich in both geometric and linear algebraic arguments. In so doing, it is hoped that the 
reader will have an opportunity to appreciate the elegance and beauty of the least-squares 
theory. 


29.1 LEAST-SQUARES PROBLEM 


Assume we have available N realizations of the random variables d and u, say, 
(d(0),d(1),...,d(N —1)) and (uo,ui,..., uui] 


respectively, where the {d(i)} are scalars and the (u;) are 1 x M. Given the (d(i), u;), 
and assuming ergodicity, we can approximate the mean-square-error cost in (29.1) by its 


sample average as 
N-1 


Bld aul du 5 Y. i) - uw (29.3) 
i=0 


In this way, the optimization problem (29.1) can be replaced by the related problem: 
N-1 
H a M , 2 
min p |d(i) — ujw| (29.4) 
where we have removed the scaling factor 1/N. 
Vector Formulation 
The cost function (29.4) can be reformulated in vector notation as follows. We collect the 


observations {d(i)} into an N x 1 vector y and the row vectors (u;) into an N x M data 
matrix H: 


d(0) Uo 

d(1) ui 

y c d(2) ; H = uz 
d(N —1) i 


Then (29.4) can be rewritten as 


min {ly — Huw|? (29.5) 


10 As explained before in Chapter 19, an ergodic random process is one for which time-averages coincide with 
ensemble averages. 
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where the notation || - ||? denotes the squared Euclidean norm of its argument, namely, 
la]? = a*a for any column vector a. Problem (29.5) is known as the standard least- 
squares problem. 

Of course, we could have posed problem (29.5) directly, without going through (29.1) 
and its approximation by (29.3). Here we have opted to motivate the least-squares crite- 
rion by relating it to the mean-square-error criterion (29.1), which we studied extensively 
in Chapters 3-10. 


Definition 29.1 (Least-squares problem) Given an N x 1 vector y and an 
N x M data matrix H, the least-squares problem seeks an M x 1 vector w 


that solves min||y — Hw||?. 
w 


Two cases can occur depending on the relation between the dimensions {N, M }: 


1. Over-determined least-squares (N > M): In this case, the data matrix H 
has at least as many rows as columns, so that the number of measurements (i.e., 
the number of entries in y) is at least equal to the number of unknowns (i.e., the 
number of entries in w). This situation corresponds to an over-determined least- 
squares problem and, as we shall see, (29.5) will either have a unique solution or an 
infinite number of solutions. 


2. Under-determined least-squares (N < M): In this case, the data matrix H has 
fewer rows than columns, so that the number of measurements is less than the num- 
ber of unknowns. This situation corresponds to an under-determined least-squares 
problem for which (29.5) will have an infinite number of solutions. 


The purpose of the discussion that follows is to show that all solutions ®© to the least- 
squares problem (29.5) are characterized as solutions to the linear system of equations 


which are known as the normal equations. In addition, the discussion will clarify under 
what conditions a unique ®© exists, as opposed to infinitely many, and it will highlight an 
important orthogonality property of least-squares solutions. In our presentation, we shall 
use both geometric and algebraic derivations to establish these facts. We start with the 
geometric argument and later show how to arrive at the same conclusions by means of 
algebraic arguments. 


29.2 GEOMETRIC ARGUMENT 


Our objective is to characterize all solutions of (29.5). We thus note first that, for any w, 
the vector Hw lies in the column span (or range space) of the data matrix H, written as 
Hw € R(H). Therefore, the least-squares criterion (29.5) is such that it seeks a column 
vector in the range space of H that is closest to y in the Euclidean norm sense. Specifically, 
the least-squares problem seeks a à? such that HO is closest to y. 

Now we know from Euclidean geometry that the closest vector to y within R(H) is 
such that the residual vector, y — HÔ, is orthogonal to all vectors in R(H) as illustrated 
in Fig. 29.1. For readers not familiar with the geometry of vectors in Euclidean space, 


FIGURE 29.1 A least-squares solution is obtained when y — HÔ is orthogonal to R(H). 


the algebraic derivations in Sec. 29.3 will arrive at the same conclusion and will therefore 
provide a justification for this claim. 
Therefore, it must hold that any candidate solution @ should result in a residual vector, 
y — H@, that is orthogonal to Hp, for any vector p or, equivalently, p* H*(y — Hi) = 0. 
Clearly, the only vector that is orthogonal to any vector p is the zero vector, so that we must 
have 
H*(y —- Hà) =0 (29.6) 


and we conclude that any solution i? of the least-squares problem (29.5) must satisfy the 


so-called normal equations: 
H*Hü = H*y (29.7) 


These equations are always consistent, i.e., a solution ® always exists. This is because, as 
was shown earlier in Sec. B.2, the matrices H* H and H* have the same column span, i.e., 
R(A*) = R(H* H). 

For any solution © of (29.7), we denote the resulting closest vector to y by  — HW 
and we refer to it as the projection of y onto R(H): 


y=HG 4 projection of y onto R(H) (29.8) 


We show in Thm. 29.2 further ahead that even when the normal equations (29.7) have a 
multitude of solutions @, all of them will lead to the same value for ¥ = Hw. After all, 
from a geometric point of view, projecting y onto R(H) results in a unique projection 
y. What the many àj's amount to, when they exist, are equivalent representations for this 
unique jj in terms of the columns of H. 

We shall denote the corresponding residual vector by 


y = y- H 
so that the orthogonality condition (29.6) can be rewritten as 

H*y=0 (orthogonality condition) (29.9) 
We shall often express this orthogonality condition more succintly as Y 1 R(H), where 
the L notation is used to mean that y is orthogonal to any vector in the range space (column 


span) of H. In particular, since, by construction, j € R(H), it also holds that 


yly o $y-O 
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Let € denote the minimum cost of (29.5). It can be evaluated as follows: 
€ = |y- Hall? 
= (y- H0) (y - Hà) 
y*(y— H0), since @*H*¥=0 by (29.9) 


= yy-y Hà 
= y'y-G'H'H, since y*H = &*H*H by (29.7) 
= ¥y-T9 


That is, we obtain the following equivalent representations for the minimum cost: 


€ = lvl? - lgl? = vg (29.10) 


In summary, we arrive at the following statement. 
Theorem 29.1 (The normal equations) A vector ij solves the least-squares 
problem (29.5) if, and only if, it satisfies the normal equations 
H* H0 = H*y 
or, equivalently, if and only if, it satisfies the orthogonality condition 
y- HG L'R(H) 


The normal equations are always consistent, i.e., a solution @ always exists 
and the resulting minimum cost is given by either expression: 


€ = |y? - lg? = vy 


where 7 = Hi) is the projection of y onto R(H) and 3j = y — jj is the residual 
vector. 


29.3 ALGEBRAIC ARGUMENTS 


Before summarizing the above discussion, and before enumerating additional properties 
of the least-squares solution, we proceed to re-derive the above results by means of two 
independent algebraic arguments. 


Differentiation Argument 
Let J(w) denote the cost function in (29.5), i.e., 


J(w) £ ly— Hwl? = lly? — y*Hw — w*H*y + w*H*Hw (29.11) 
Differentiating J (w) with respect to w we find that its gradient vector evaluates to zero at 


all © that satisfy 
-y'H + 0*H*H =0 


which are again the normal equations (29.7). The solution(s) © so obtained correspond to 
minima of J(w) since its Hessian matrix is nonnegative-definite, i.e., 


V2 [J(w)| = H*H >0 


Completion-of-Squares Argument"? 
Starting from the least-squares criterion (29.5), we can alternatively resort to a completion- 
of-squares argument in order to characterize its solution(s). We used this technique on a 
handful of occasions before (e.g., in Sec. 3.3). 

Thus, note that J(w) in (29.11) is quadratic in w and it can be written as 


J(w) = [y w] | Ks s | | A (29.12) 


Now the central matrix in (29.12) can be factored into a product of block upper-diagonal- 
lower triangular matrices as follows (recall (3.12)): 


In -H |_|v D Iy + DH* 0 IN 0 (29.13) 
-H* H*H 0 Iw 0 H*H D* Im : 


where D is chosen as any N x M matrix satisfying the linear equations 
D(H*H) = -H (29.14) 


For compactness of notation, we are not indicating explicitly the dimensions of the block 
zero entries in (29.13); these dimensions can be inferred from the dimensions of {D, H} 
and the identity matrices. The equations (29.14) are consistent since the matrices H* H and 
H* have the same column spans and, therefore, by writing (29.14) as H*HD* = —H*, 
we conclude that a solution D always exists. 

The triangular factorization (29.13), in this general form, i.e., in terms of a matrix D 
that solves (29.14), allows us to accommodate the case of singular H* H. Of course, if 
H*H were invertible, then D = —H(H* H)-1, and we could replace the factorization 
(29.13) by the more explicit (and familiar) form: 


E 
[Iv -H(H'H)' | | a 0 ]| Iy 0 


0 Im H*H || —(H*H)H* ly 


However, we shall continue to use (29.13) in order to accommodate the general case of 
possibly singular H* H. Substituting (29.13) into (29.12) we find that 


J(w) =y*(1+ DH*)y + (w+ D'y)' H* H(w + D*y) (29.15) 


where only the second term depends on the unknown w. This second term is always 
nonnegative since 


(w + D*y)' H'H(w-- D*y) = |H(w-- D'*y)|? 20  forany w 


This section can be skipped on a first reading. 
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Therefore, 
J(w) > y* (I+ DH*)y, for any w 
and the minimum of J(w) is attained when w is chosen to annihilate the second term of 


(29.15), i.e., when it is chosen as any à? satisfying H (© + D*y) = 0. In this way, we arrive 
at the conclusion that all solutions to (29.5) are vectors © that satisfy 


Hi —-HD*"y, where D is any solution to DH*H = —H (29.16) 


This description is equivalent to the normal equations (29.7). To see this, we multiply 
(29.16) from the left, and use the definition of D from (29.14), to conclude that ® satisfies 
the normal equations 


H'Hó = H*y (29.17) 


Conversely, if © satisfies (29.17), then H* Hi = —H* H D*y or, equivalently, 
H*H(@ + D*y) =0 


This means that 
@+D*y c N(H'H) 
But since we already know from Sec. B.2 that the matrices H* H and H have the same 
nullspace, we conclude that 
© + D*y E N(B) 
That is, H (P + D*y) = 0, which agrees with (29.16). 


Note that the equations (29.17) re-confirm our earlier remark in (29.6) that the residual 
vector, y — H@, satisfies the orthogonality condition 


H'(y - Hi) 20 (29.18) 
Moreover, the resulting minimum value of the cost function will be 


J(@) y*(I+ DH*)y 

= y'y-f'y since J= HG = -HD*y by (29.16) 

= yy-V' ry 

= y'y-' 9, since g € R(H) and y*y=0 by (29.18) 
- |lyll? - igi 


as in (29.10). 


29.4 PROPERTIES OF LEAST-SQUARES SOLUTION 


Actually, we can say more about the solution of the least-squares problem. The following 
statement enumerates several additional properties. In particular, it specifies when a unique 
solution © exists and, perhaps more interestingly, it states that even when many solutions 
Ù exist, the projection Y = Hw is always unique! 


483 
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lem (29.5) has the following properties: PROPERTIES OF 
LEAST-SQUARES 
1. The solution @ is unique if, and only if, the data matrix H has full SOLUTION 


column rank (i.e., all its columns are linearly independent, which nec- 
essarily requires N > M). In this case, iU is given by 


@ -(H'H)H'y 
This situation occurs only for over-determined least-squares problems. 


2. When H* H is singular, then infinitely many solutions i exist and any 
two solutions differ by a vector in the nullspace of H, i.e., if @ and 
@ are any two solutions, then i21 — @2 € N(H). This situation can 
occur for both over- and under-determined least-squares problems.? 


3. When many solutions @ exist, regardless of which one we pick, the 
resulting projection vector 7 = Hw is the same and the resulting mini- 
mum cost is also the same and given by £ = |ly||? — i|gl?. 


4. When many solutions © exist, the one that has the smallest Euclidean 
norm, namely, the one that solves 


min ||G|? subject to. E* Hi = H*y 
w 


is given by © = Hty, where H? denotes the pseudo-inverse of H. 


Proof: We proceed in steps. 


1. The normal equations H*Hij = H*y have a unique solution if, and only if, H* H is in- 
vertible. This condition cannot happen when N « M since the product H* H will be rank 
deficient. Therefore, we must have N > M. Moreover, recall that we proved in Sec. B.2 that 
for any matrix H having at least as many rows as columns, it holds that H* H is invertible if, 
and only if, H has full rank. These facts establish the first statement in the theorem. 


2. Letr be any nonzero vector in the nullspace of H* H, i.e., H* Hr = 0. If © solves the normal 
equations, i.e., if H* Hi) = H*y, then so does © +r since H*H (© +r) = H*y. Therefore, 
infinitely many solutions to the normal equations exist in this case. Moreover, if © and i22 
are any two solutions, say, H* H@, = H*y and H* H@2 = H*ythen H* H(@; — i2) = 0. 
That is, 0; — 2 € N'(H" H). However, we proved in Sec. B.2 that, for any matrix H, the 
matrices H* H and H have the same nullspace and, hence, H(i, — 402) = 0. These facts 
establish the second statement in the theorem. 


3. Let ©ı and £2» denote any two solutions when multiple solutions exist and let #1 = Hô and 
J2 = Hi)» denote the corresponding projections. Then Jı — $2 = H(i: — i02) = O since, 
by the second property, @1 — @2 € M(H). Therefore, Fa = 92, which establishes the third 
statement in the theorem. 


4. We first remark that, for a general matrix H, the pseudo-inverse is defined in Sec. B.6, where 
the fourth statement in the theorem is also proven (see Lemma B.7). Here we note that when 
H has full rank, its pseudo-inverse is given by the following expressions: 


(H'H)H* when N » M (a "tall" matrix) 
H'24 H'(HH*)'! when N « M (a "fat" matrix) 
H^ when N = M (a square matrix) 
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When H is rank-deficient, it is more convenient to define its pseudo-inverse in terms of its 
singular value decomposition, as explained in Sec. B.6. [See also Prob. VII.6 for a proof, from 
first principles, of the fourth statement of the theorem in the under-determined case. ] 


o 


29.5 PROJECTION MATRICES 


We restrict ourselves in this section to the case of over-determined least-squares problems 
with a full-rank data matrix H (and, hence, N > M). In this case, the coefficient matrix 
H*H is invertible (actually positive-definite) and the least-squares problem (29.5) will 
have a unique solution that is given by 


@ —(H'H) !H*y 
with the corresponding projection vector 
F= Hi = H(H*H)"H*y 


The matrix multiplying y in the above expression is called the projection matrix and we 
denote it by 


Py ê H(H*H)-!H*, when H has full column rank (29.19) 


The designation projection matrix stems from the fact that multiplying y by Py amounts 
to projecting it onto the column span of H. Such projection matrices play a prominent 
role in least-squares theory and they have many useful properties. For example, projection 
matrices are Hermitian and also idempotent, i.e., they satisfy 


Pi =Pu, Pi =Pu (29.20) 


Note further that the residual vector, 
j-y-Ho 
is given by 


y -Pyy = (I- Pu)y = Phy 


is called the projection matrix onto the orthogonal complement space of H. It is also easy 
to see that the minimum cost of the least-squares problem (29.5) can be expressed in terms 
of Pj as follows: 


el 
ll 


so that the matrix 


€ = yvy-V 

y*y - y*PHPHY 
= y“y—y*PHy, since PE Pg = T =Py 
= Pay 


-————————————— ———— 
Lemma 29.1 (Unique solution) When the matrix H has full-column rank (and, 
hence, N > M), the least-squares problem (29.5) will have a unique solution 
that is given by © = (H* H)-! H*y Moreover, the projection of y onto R(H), 
and the corresponding residual vector, are given by F = Pyy and J = Py 
so that y can be decomposed as 


y = 9+9 = Pay + Puy 


with ||y||? = ll]? + [[]|?. The resulting minimum cost is € = y*Pxy. 


29.6 WEIGHTED LEAST-SQUARES 


It is often the case that weighting is incorporated into the cost function of the least-squares 
problem, so that (29.5) is replaced by 


min (y — Hw)*W(y — Hw) W>0 (29.21) 


where W is a Hermitian positive-definite matrix. For example, when W is diagonal, its 
elements assign different weights to the entries of the error vector y — Hw. 
We shall often rewrite the cost function in (29.21) more compactly as 


min |y — Hula (29.22) 


where, for any column vector z, the notation ||x||} refers to the weighted Euclidean norm 
of z, i.e. ||a||?,, = z*Wz. 

One way to solve (29.22) is to show that it reduces to the standard form (29.5). To see 
this, we introduce the eigen-decomposition 


W =VAV" 


where A is diagonal with positive entries and V is unitary, i.e., it satisfies VV* = V*V = 
I. Let A!/? denote the diagonal matrix whose entries are equal to the positive square-roots 
of the entries of A, and define the change of variables: 


a Ê AV2y*y, A Ê AV2y*Hg (29.23) 


Observe that since both A1/? and V are invertible, it follows that A has full column rank 
if, and only if, H has full column rank. 
Using the variables (a, A} so defined, we can rewrite the weighted problem (29.21) in 
the equivalent form 
min [a— Aw|? (29.24) 


which is of the same form as the standard (unweighted) least-squares problem (29.5). 
Therefore, we can extend all the results we obtained for (29.5) to the weighted version 
(29.21) by working with (29.24) instead. In particular, we readily conclude from (29.24) 
that any solution @ to (29.21) should satisfy the orthogonality condition A* (a — Ad) = 0 
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(cf. (29.6)), which, upon using the definitions (29.23) for {a, A}, can be rewritten in terms 
of the original data {y, H} as 


H*val? (A! 2v*y 3 AV2V* Hà) =0 


or, equivalently, 
H*W(y — H®)=0 (29.25) 


Comparing with the orthogonality condition (29.6) in the unweighted case, we see that 
the only difference is the presence of the weighting matrix W. This conclusion suggests 
that we can extend to the weighted least-squares setting the same geometric properties 
of the standard least-squares setting if we simply employ the concept of weighted inner 
products. Specifically, for any two column vectors {c,d}, we can define their weighted 
inner product as (c,d) w = c*Wd, and then say that c and d are orthogonal whenever their 
weighted inner product is zero. Using this definition, we can interpret (29.25) to mean that 
the residual vector, y — HÔ, is orthogonal to the column span of H in a weighted sense, 
ie., 
(q,y — Hü)w =0 foranyq E€ R(H) 


We further conclude from (29.25) that the normal equations (29.7) are now replaced by 


H*WHO = H*Wy (29.26) 


Proceeding in this manner, and applying the results of the previous sections (especially 
Thms. 29.1 and 29.2) to (29.24), we arrive at the following statement — see Prob. VII.9. 


Theorem 29.3 (Weighted least-squares) A vector © is a solution of the 
weighted least-squares problem (29.21) if, and only if, it satisfies the normal 
equations H*W Hí) = H*W y. Moreover, the following properties hold: 


1. These normal equations are always consistent, i.e., a solution © always 
exists. 


2. The solution ® is unique if, and only if, the data matrix H has full 
column rank, which necessarily requires N > M. In this case, @ is 
given by © = (H*W H) H*Wy. 


3. When H*W H is singular, which is equivalent to H* H being singular, 
then many solutions @ exist and any two solutions differ by a vector 
in the nullspace of H, i.e., if @, and z are any two solutions, then 
H(@, — d) = 0. 


4. When many solutions 4j exist, regardless of which one we pick, the 
resulting projection vector j = HÔ is the same and the resulting 
minimum cost is also the same and given by either expression: € = 
luly — Gli = y WG, where G = HO. 


5. When many solutions à) exist, the one that has the smallest Euclidean 
norm, namely, the one that solves 


min |||? subjectto H*WHG = H*Wy 
w 


is given by @ = Ata, where A = A!/2V*H and a = A!/?V*y, 


Note that in the special case of over-determined least-squares problems with full-rank 
data matrices H (and, hence, N > M), problem (29.21) will have a unique solution that is 
given by 

© —-(H'WH) H'Wy 
with the corresponding projection vector 
9 = HG = H(H*WH)"'H*Wy 


We shall continue to refer to the matrix multiplying y in the above expression as a projec- 
tion matrix, 


Py Ê H(H*WH)-!H*W, when H has full column rank (29.27) 


and write ¥ = Pyy. In contrast to the unweighted case (29.20), the matrix Py now 
satisfies the properties: 


PEW —W?Pmn, TP =Py, PuWPy = WPH 
and the minimum cost of (29.21) can be expressed in terms of Ph as follows: 
E = y*Wy- FWG = yVWy-y'WPay = y'WPgy 
where Pj; =I- Pg. 


29.7 REGULARIZED LEAST-SQUARES 


A second variation of the standard least-squares problem (29.5) is regularized least-squares. 
In this formulation, we seek a vector ®© that solves 


min | (w — u)*I(w — &) + lly- Awl? | (29.28) 


where, compared with (29.5), we are now incorporating the so-called regularization term 
w — w||2,. Here, I is a positive-definite matrix, usually a multiple of the identity, and w 
is a given column vector, usually i» = 0. 

One motivation for using regularization is that it allows us to incorporate some a priori 
information about the solution into the problem statement. Assume, for instance, that we 
set II = 6J and choose 6 as a large positive number. Then, the first term in the cost function 
(29.28) becomes dominant and it is not hard to imagine that the cost will be minimized by 
a vector iP that is close to w in order to offset the dominant effect of this first term. For 
this reason, we say that a “large” II reflects high confidence that w is a good guess for the 
solution ©. On the other hand, a "small" II indicates a high degree of uncertainty in the 
initial guess w. 

Another reason for regularization is that it relieves problems associated with rank de- 
ficiency in the data matrix H. To clarify this issue, we first need to solve (29.28). The 
solution can be obtained in many ways (including, e.g., plain differentiation of the cost 
function with respect to w as was done in Sec. 29.3). We choose instead to solve (29.28) 
by showing again how it can be reduced to the solution of a standard least-squares prob- 
lem of the form (29.5), This line of argument helps clarify the role of the orthogonality 
condition in regularized least-squares. 
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Thus, introduce the change of variables z = w — w and b = y — Hù, so that the 
regularized cost function in (29.28) becomes 


min [ 2*Ilz+ ||b — Hz||? ) (29.29) 


Introduce further the eigen-decomposition of II, say, II = U AU* where U is unitary and 
A has positive diagonal entries. Let A!/? denote the diagonal matrix whose entries are the 
positive square-roots of the entries of A. Then we can rewrite the cost in (29.29) in the 


equivalent form 
0 Al2y* 
bJ | H J” 


This problem is now of the same form as the standard least-squares problem (29.5), where 
the roles of the vector y and the data matrix H are played by 


0 d At /2 U* 

bj v H 
respectively. The orthogonality condition (29.6) of least-squares solutions then shows that 
any solution Z must satisfy 


PII) o 


which, upon using 2 = © — à», leads to the linear system of equations: 


2 


min (29.30) 


(I+ H*H] (a — 0) = H*(y - Hu) (29.32) 


These equations characterize the solution of the regularized least-squares problem (29.28). 
When w = 0, the equations reduce to 


I+ H*H] ô = H*y (29.33) 


Compared with the normal equations (29.7) for the standard least-squares problem (29.5), 
we see that the presence of the positive-definite matrix II in (29.33) guarantees an invertible 
coefficient matrix since II + H*H > 0, regardless of whether H is rank-deficient or not 
and regardless of how the row and column dimensions of H compare to each other. For 
this reason, the solution ®© of the regularized least-squares problem, as given by (29.32), 
exists and is always unique. 

Using the expression given in Thm. 29.1 for the minimum cost of a standard least- 
squares problem, we can similarly evaluate the minimum cost of the regularized least- 
squares problem (29.30) as 


c= [IUE IS 


= (y-Hw)yj (29.34) 


where 


An equivalent expression for € can be obtained as follows: 


€ = (y-Hu)((y-Hw)- H(à-w) 
= (y-Hu) [I- HI + HH]! H*] (y - Hu) 
= (y- Huy [1+ HI 3H] (y - Hà) (29.35) 


where we used the matrix inversion lemma (5.4) in the last step. Note that expression 
(29.34) does not include II explicitly. 
Finally, observe that the orthogonality condition (29.31) can be written as 


l/2rpr— 
[ UA? pepe eg 


b 


where E 

b—-b—-HZ-(y-Hw)-H(ó-w)-y-g-2y 
Recalling that II = UAU* and 2 = dj — w, the above orthogonality condition can be 
rewritten more compactly as 


H*j = Il(® — w) (orthogonality condition) (29.36) 


In the absence of regularization, i.e., when II — 0, the above result reduces to the standard 
orthogonality condition (29.9), namely, it becomes H*y = 0. 


; 
Theorem 29.4 (Regularized least-squares) The solution of the regularized 
least-squares problem (29.28) is always unique and given by 


Ô = 0 + (+ H*H]^! H'(y - Ho) 
The resulting minimum cost is given by either expression: 


€ = (y-Hu)g = (y- Hà) [1+ HI! H'] (y - Hu) 


where J = y — and F = H. Moreover, © satisfies the orthogonality 
condition H*y = II(ib — w). 


29.8 WEIGHTED REGULARIZED LEAST-SQUARES 


We can combine the formulations of Secs. 29.6 and 29.7 and introduce a weighted regular- 
ized least-squares problem. The weighted version of (29.28) would have the form 


min [(w — Q)*H(w — 3) + (y - Hw)'W(y - Hw)] (29.37) 


where, as before, W is positive-definite. Actually, with II > 0, the weighting matrix W 
can be allowed to be nonnegative-definite. It is easy to verify that all the expressions in 
Thm. 29.5 further ahead that do not involve an inverse of W will still hold. 

Again, the solution of (29.37) can be obtained in many ways (including plain differen- 
tiation with respect to w). Here, as before, we choose to solve (29.37) by showing how it 
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reduces to the standard least-squares problem (29.5). For this purpose, we resort one more 
time to the eigen-decomposition W = V AV'*, and define the normalized quantities 


a=Al?y*y and A= AV?V*H 


Then the weighted regularized problem (29.37) becomes 


min [ (w — w)*T(w — à) + lja - Aw|? ) (29.38) 


which is of the same form as the unweighted regularized least-squares problem (29.28). 
We can therefore invoke Thm. 29.4, and the definitions of (a, A} above, to arrive at the 
following statement, where the orthogonality condition (29.36) is now replaced by 


H*Wy = II(à — u) (orthogonality condition) (29.39) 


Theorem 29.5 (Weighted regularized least-squares) The solution of the 


weighted regularized least-squares problem (29.37) is always unique and given 


by 


€ E 


© = 0 + [I+ H*WH]| ! H*W(y — Hà) 
and the resulting minimum cost is given by 
(y- Hà) Wy = (y- Ho) W^ + HIH] 


where y = y — ĵ and ¥ = Hw. Moreover, © satisfies the orthogonality 
condition H*Wy = II(@ — ù). 


"(y- Hu) 


For ease of reference, we summarize in Tables 29.1 through 29.3, several of the proper- 
ties of the least-squares problems studied in this chapter. 


TABLE 29.1 Normal equations associated with several least-squares problems. 


Problem 


Standard 


least-squares 


Weighted 
least-squares 


Regularized 


least-squares 


Weighted 
regularized 


least-squares 


Cost function 


min ||y — Hw]? 
w 
min |y — Hw|\?,, W >0 
w 


min |w- alk + y- Hw]? 
I1» 0 


min |w — lla + lly- Hwll 
I2»0W20 


Norma! equations 


H* Hô = H*y 


H*WHG = H*Wy 


(II + H* H)(8 — w) = H* (y - Hw) 


(II-- H'WH)(Q — w) = H*W (y — Hà) 


TABLE 29.2 
statements below, jj = y — ĵ where 7 = HÔ. 


Problem 


Standard 


least-squares 


min |y — Hw|l? 


Weighted min |y — Hwltj, W » 0 
w 


least-squares 


Regularized | min |w — || + ly- Hwl? 
w 


least-squares TI>0 


Weighted 


regularized | min |w — üll + lly- Hwllw 


least-squares 


Orthogonality conditions associated with several least-squares problems. 


|... Cost function | Orthogonality condition 


H*Wj - Il(G — u) 


Tl>0,W>0 
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TABLE 29.3 Minimum costs associated with several least-squares problems. In the statements 
below, y = y — 7 where gj = HO. 


Standard 


Cost function 


min |y — Hw? 
w 


least-squares 


Weighted min ||y~ Hw|iy, W >0 


least-squares 


Regularized | min ||w — i2|là + (ly — Hwll? 
w 


least-squares I0 


Weighted 


regularized min |w — a, + lly- Hwll?, 
uw 


least-squares 


II2500W20 


(y - Hw) Wy 


cura JU 


Recursive Least-Squares 


N. that we have studied in some detail the least-squares problem, we proceed to de- 
rive an algorithm for updating its solution. The resulting recursion will be referred to as 
the Recursive Least-Squares (RLS) algorithm and it will form the basis for most of our 
discussions in the future chapters. 


30.1 MOTIVATION 


Given an N x 1 measurement vector y, an N x M data matrix H and an M x M positive- 
definite matrix II, we saw in Sec. 29.7 that the M x 1 solution to the following regularized 
least-squares problem: 


min [w*Ilw + ly- Hwl? | (30.1) 


is given by 
@ = (1+ H*H) !H*y (30.2) 


where, in comparison with (29.28), we are assuming w = 0 for simplicity of presenta- 
tion. The arguments would apply equally well to the case © zz 0 — see the remark after 
Lemma 30.1. 

We denote the individual entries of y by {d(z)}, and the individual rows of H by {u;}, 
say, 


d(0) uo 

d(1) ui 

y= d(2) : H- u2 
d(N — 1) Ta 


so that the solution @ in (30.2) is determined by data {d(i), u; } defined up to time N — 1. 
In order to indicate this fact explicitly, we shall write wy_ instead of © from now on, 
with a time subscript (N — 1). We shall also write yy_1 and Hn- instead of y and H 
since these quantities are defined in terms of data up to time N — 1 as well. With this 
notation, we replace problem (30.1) by 


min [w'Ilw + lyy-i - Hn-1wl? ] (30.3) 
and its solution (30.2) by 
WN-1= (1+ Hy Hn) 1 Hy Aya (30.4) 
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In recursive least-squares, we deal with the issue of an increasing N, and, hence of an 
increasing amount of data. If, for example, one more row is added to Hy_1 and one more 
entry is added to yx. 1, leading to 


YN = | UNS) | |o Hy= | Hoi | (30.5) 


then the solution wy of the time-updated least-squares problem 


min [wlw + llyn- Hull? ] (30.6) 


would become 
UN = (1+ HS Hyn) HA yN (30.7) 


Going from (30.4) to (30.7) is referred to as a time-update step since it amounts to employ- 
ing new data (d(N), uy) in addition to all previous data {d(j),u;,0 € j € N — 1}. 

Now, performing time-updates by relying on expressions (30.4) and (30.7) is costly both 
computationally and memory-wise. This is because they require that we invert the M x M 
coefficient matrices 


(I-- H&Hw) and (1+ H}_,Hn-1) 


at the respective time instants. In addition, (30.7) requires that we store in memory the en- 
tries of (Hy 1, yn—1} so that (Hy, yn} could be formed when the new data (uy, d(N)) 
become available. These two requirements of matrix inversion (requiring O(.M?) opera- 
tions) and increasing storage capacity can be alleviated by seeking an update method that 
would compute wy solely from knowledge of the new data (d(N), uy) and from the 
previous solution wy-—1. Such a method is possible since, as we see from (30.5), the quan- 
tities {Hy_1,yn-1} and (Hy, yw) differ only by the new data (uw, d(N)); all other 
entries are identical. Exploiting this observation is the basis for deriving the recursive 
least-squares algorithm. 


30.2 RLS ALGORITHM 


Introduce the matrices 


Py 2 (D H&Hy)^, Pus (30.8) 


with initial condition P_; = I~}. Then (30.4) and (30.7) can be written more compactly 


as 
wn-1 = PN-AHN AUN wy = PN HUN (30.9) 


The time-update relation (30.5) between (yw, Hw] and (yw-1, Hn-1} can be used to 
relate Py to Py i and wy to wy. 4. 


Derivation 
To begin with, note that 


Po II + Hy Hx 


I+ Hy,_,Hn-1 + uyun 
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so that 


6010 


Then, by using the matrix inversion identity 
(A+ BCD)! = A“! — A B(C^! + DA 3B) DA"! (30.11) 
with the identifications 
Ae Pul, Beuy, Cel, Deun 


we obtain a recursive formula for updating Py directly rather than its inverse, 


Py_-1uytn Pn-1 


Py = Py-1 — R Pars Ir! (30.12) 


1+ un Pu-1UN 


This recursion for Py also gives one for updating the least-squares solution wy itself. 
Using expression (30.9) for wy, and substituting the above recursion for Py, we find 


wn = Py [HN UNA + uxd(N)] 


Py -iuyun PN- 
(pss - N-1uy UN PN-1 


Hx. -1+und(N 
eun) N-1yn-1 + uNd(N)] 


Pn_1u} 
= PNAHN iyn- -Nun PuAHN AUN 
N-19. 1+ uy Py-iuy N-19 
=WN-1 =WN-1 


‘ un PN AUN 
Pye 1- —————— J] daN 
TEN ux ( ett) (N) 


That is, 


Py—iun 


WN = WN-i-c [G(N) - uywu-i, w-1=0 (30.13) 


1+ unPy_iuy 


We summarize the time-updates for (Py, wy} in Alg. 30.1 below, where we also in- 
troduce two important quantities. One is called the conversion factor, for reasons to be 
explained shortly in Sec. 30.4, and is defined by 


X(N) Ê 1/(1 uy Py-ius) (30.14) 
whereas the other is called the gain vector and is defined by 
4 * 
gn = Pyu (N) (30.15) 


Some straightforward algebra, using recursion (30.12) for Py, shows that {gn y(V)} are 


also given by 
y(N) =1-unPnuy (30.16) 
Goan 


and 


These alternative expressions for (gw ,"Y(N)) are in terms of Py while the first two ex- 
pressions (30.14)-(30.15) are in terms of Py... To justify (30.17)-(30.16) simply note 
the following. Multiplying recursion (30.12) for Py by u% from the right we get 


* * 
Peut = P « PN-1UNuN PN-AUN 
NUN = FN-1UN - — p. 
1+unPy_-1uy 


20 Prey 0 
= - — JN 
l+unPy-1uy 


By further multiplying the above identity by ux from the left we get 


un PN-1UN 


uuN Py ui = —————* 
N I4 un Py_1uy 


so that, by subtracting 1 from both sides, we obtain (30.16). 


Algorithm 30.1 (RLS algorithm) Given II > 0, the solution wx that mini- 


mizes the cost 
wlw + lyn- Hnwll? 


can be computed recursively as follows. Start with w_; = 0 and P., = II^! 
and iterate fori 20: 


yi) = 1/1 +uPi-1uj) 
gi = Piu) 
wi = wii gild(i) — uiwi-i] 
P; = PL-gig/Yi 


At each iteration, it holds that w; minimizes w* Ilw + ||y; — H;wl|?, where 
yi = col (d(0), d(1),..., d(i)) and the rows of H; are (uo, ui,..., uj]. More- 
over, P; = (I + H1 H;) 5. 


30.3 REGULARIZATION 


Observe that regularization is necessary (i.e., the use of II > 0), since it guarantees the 
existence of Py = (Il + Hs; Hw) !. In the absence of regularization, i.e., when II = 0, 
the above inverse need not exist especially during the initial update stages when Hy has 
fewer rows than columns or even at later stages if Hy becomes rank deficient. In these 
situations, the RLS recursions of Alg. 30.1 would not be applicable, not only because the 
matrix Py is not defined but also because of the non-practical initialization P_; = ool. 

In Sec. 35.2 we shall describe an alternative implementation of RLS that addresses these 
difficulties; it can be used even in the absence of regularization and has better numerical 
properties than RLS itself. In preparation for the derivation of this alternative algorithm 
later in Sec. 35.2, we indicate here some of the RLS equations that will be needed in that 
section. 

Thus, note that expression (30.7) for wy is simply a rewriting of the normal equations 


(I+ Hy Hn)wn = Hyyn (30.18) 
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which characterize the solution of the regularized least-squares problem (30.6). The form 

(30.18) is more general than (30.7) since it holds even when (II-- H5, Hy) is singular (e.g., 

when II = 0 and Hy is rank deficient) — recall (29.33) and the discussion thereafter. 
Now if we define the M x M matrix ®y and the M x 1 vector sy, 


Sy $ TIL HyHy, sy Ê Hey 


then, just like the recursion (30.10) for Py 1 we also obtain a recursion for ®y: 


Oy =On-1+uyun, aN (30.19) 


and using (30.5), it is immediate to see that sy satisfies the time-update recursion: 
sn = 8N-1 + und(N), s1 = 0 (30.20) 


With (9 y, sw) so defined, the normal equations (30.18) can be rewritten as 


Goan 


The point is that this description is valid even when II = 0 since, as we already know from 
Thm. 29.1, the corresponding normal equations 
ANH NWN = HNUN 
— — 
By SN 
are consistent so that a solution wy can always be found (e.g., one could choose the 
minimum-norm solution when multiple solutions exist). The equations (30.19)-(30.21) 


will form the basis for the aforementioned alternative implementation of RLS in Sec. 35.2; 
one that has better properties in finite-precision arithmetic and can be used even if II — 0. 


30.4 CONVERSION FACTOR 


We can also derive a time-update relation for the minimum costs associated with problems 
(30.3) and (30.6). To do so, we first associate with RLS two error quantities: the a priori 
output error, denoted by e(N), 


e(N) Ê d(N)—uywy-y (30.22) 
and the a posteriori output error, denoted by r( N), 
T(N) Ê d(N) - unwy (30.23) 


It then turns out that the factor y(N ) defined by (30.14) serves a useful purpose: it maps 
e(N) to r(N), i.e., it converts the a priori error into its a posteriori version. To see this, 
we replace wy in the definition of r( N) by its RLS update from Alg. 30.1 to obtain 


r(N) = d(N)—un( wy-1+9n[d(N) — unwn-i] ) 
d(N) — unwy-1 ~ ungne(N) 

e(N) — ungne(N) 

= (1-ungn)e(N) 


Using (30.16) and (30.17) we get 
r(N) = 4(N)e(N) 


as desired. Observe further from the definition of y(V) in (30.14) that 


028 


Ir(N)| € le(N)| 


so that it always holds that 


30.5 TIME-UPDATE OF THE MINIMUM COST 


Now let £(N — 1) denote the minimum cost of the regularized least-squares problem (30.3). 
Likewise, let £(/N) denote the minimum cost of the time-updated problem (30.6). From 
Thm. 29.4 we know that £(.N — 1) and £(N) are given by 


E(N) = ynlyn~Hnwn], EN - 1) =yy-ilyn-1 — Hu-iww-i) 
Using the partitioning (30.5) for (yw, Hav}, as well as the RLS update 
wy =wn-1t+gne(N), e(N) =d(N) —unwn-i 


we can arrive at a relation between £(N) and E(N — 1) by means of the following sequence 
of calculations: 


UN-1— Hy-1[wn-1 + gue(N)) 
EN) = [yka  '(N)] 
d(N) — ug [wu-i + gne(N)| 


= YN—1(yn-1 7 Hy-1wn-1) —yn_-1Hn-igne(N) + d*(N)e(N)[1 — ungn] 
u——— 


-EN-1) 


H 


E(N — 1) - e(N) |d (N) (1— ungn) -uN-iHN-1i9N 
— 


=7(N) 


I 


E(N — 1) c e(N)y(N) |d (N) — yk 1 HN -1PN-1 UN 
— 


=wN -1 


E(N ~ 1) + le( NPN) 


Moreover, since r(N) = 7(N)e(N), it holds that 
le(N)/?o(N) = e(N)r*(N) = e*(N)r(N) = Ira) QD) 


In summary, we arrive at the following conclusion. 
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Lemma 30.1 (Estimation errors) Consider the same setting of Alg. 30.1. At 
each iteration i, the a priori and a posteriori estimation errors defined by 


e(i) = d(i) — uiwi-1 and  r(i)-d(i)- uw; 


are related via the conversion factor y(i) as r(i) = *(i)e(i). Moreover, the 
minimum costs of the successive regularized least-squares problems satisfy 
any of the following time-update relations with initial condition £(—1) — 0: 


&(i) Eli — 1) + e(i)r* (i) 


Eli — 1) + yG)leG)? 
Eli = 1) + IrG)P/yG) 


One final remark is in place. Had we considered a regularized cost function of the form 


min [(w— u)'I(w — 2) + llyn - Ewvl?] 


with a nonzero 7, then the only modification to the RLS algorithm would be in the value 
of its initial condition, w..;; it becomes w..; = 1». Moreover, the time-update for €(V) 
will still hold. To see this, we only need to repeat the above derivation for £(N) starting 
from the expression (cf. Table 29.3) — see Prob. VII.29: 


E(N) = [yy — Huw]|*[uv — Hwww]| 


30.6 EXPONENTIALLY-WEIGHTED RLS ALGORITHM 


It is more common in adaptive filtering to employ a weighted regularized least-squares 
cost function, as opposed to the unweighted cost in (30.6). More specifically, a diagonal 
weighting matrix is used whose purpose is to give more weight to recent data and less 
weight to data from the remote past. 

Let A be a positive scalar, usually very close to one (e.g., A = 0.998 or some similar 
value), say, 0 « A < 1, and introduce the diagonal matrix 


Ay £ diag(AN, AN... A1) (30.25) 
Then replace (30.6) by 
min [ Du Tw + (yy — Hyw)*An(yn — Hyw) | (30.26) 


or, more explicitly, by 


N 
min iens + Sia) - wer (30.27) 


j=0 


The scalar is called the forgetting factor since past data are exponentially weighted less 
heavily than more recent data. The special case \ = 1 is known as the growing memory 


case and it was studied in the previous sections. Exponential weighting is one form of data 
windowing whereby the effective length of the window is  1/(1 — A) samples. 

Observe that the regularization matrix in (30.26) is chosen as AV UTI, with the addi- 
tional scaling factor AU +1), Since this factor becomes smaller as time progresses, we see 
that the exponentially-weighted cost function (30.26) is such that it de-emphasizes regu- 
larization during the later stages of operation when the data matrix Hy is more likely to 
have full rank. 

Comparing (30.26) with (29.37), and using the identifications II — A+) Tl and W — 
An, we find that the solution wy is obtained by solving 


wen + Hx Hy] wy = HyAnyn (30.28) 


If we now define the quantities 


-1 
Py ê [janon + HyAvHy| 
gN E A71 Py uM (N) 
(N) Ê 1/(1+A7!uyPy_1u',) 


and repeat the arguments prior to the statement of Alg. 30.1, we arrive at the following 
statement. 


—  ——ÓÁ————— 
Algorithm 30.2 (Exponentially-weighted RLS) Given II > 0, and a forgetting 
factor 0 « A < 1, the solution wy of the exponentially-weighted regularized 
least-squares problem (30.27), and the corresponding minimum cost £(N), 
can be computed recursively as follows. Start with w_, = 0, P_, = II}, 

and £(—1) = 0, and iterate for > 0: 


a(t) = 1/(1 +AT; Piu) 
gi = AP.) 

eli) = d(i)— uwi- 
wi = wi- + gieli) 


P, = X!Ba- giai) 
(i) = Ali- 1) + Ye)? 


At each iteration, P; has the interpretation P; = [AC+Y IT + HNH] and 
w; is the solution of 
i 
; (i1), i-jp (Y — 9) anl2 
min AY Tw Tw + » ATI |d(j) — ujw| 
In addition, as was the case with (30.16)-(30.17), the following relations hold: 
gi = Piu, yi)=l-uPuj=l—ugi, r(i)-a(i)e() 


where r(i) = d(i) — uwi. 
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Again, starting from the normal equations (30.28), we can define the quantities 


ANDI + H&AN Hy 
HyAnyn 


Oy 


i> ie 


SN 
Then it can be easily verified that they satisfy the recursions 


Oy = AÓy i04 UUNUN, =H 
sn = Asn-ituyd(N), s8-1=0 


and that wy can be found by solving the normal equations 


@nwy = SN 


(30.29) 
(30.30) 


(30.31) 
(30.32) 


(30.33) 


As mentioned in Sec. 30.3, these equations will be used in Sec. 35.2 to motivate an alter- 


native recursive implementation of the exponentially-weighted RLS algorithm. 
Moreover, had we started with a regularized cost function of the form 


N 
min | AN+) (uy — g)*H(w — 0) + Y M7) ~ ujw|? 
w 

j=0 


with a nonzero à, then the only modification to the RLS equations of Alg. 30.2 would 
be in the value of the initial condition, w_1; it becomes w_; = 1». Likewise, equations 


(30.31)-(30.33) would be replaced by 


Gy = y+ UNUN, @_, =I 
SN ÀSN-1 + un(d(N) - ux], $120 
On(wn — db) = SN 


(30.34) 
(30.35) 
(30.36) 


CHAPTER 3 1 


Kalman Filtering and RLS 


Tis is a close relation between regularized least-squares problems, as studied in the 
previous chapter, and linear least-mean-squares estimation problems, as studied in Part 
II (Linear Estimation). Although the former class of problems deals with deterministic 
variables and the latter class of problems deals with random variables, both classes turn out 
to be equivalent in the sense that solving a problem from one class also solves a problem 
from the other class and vice-versa. 


31.1 EQUIVALENCE IN LINEAR ESTIMATION 


Stochastic Problem 
Let x and y be zero-mean random variables that are related via a linear model of the form 


aun 


for some known matrix H and where v denotes a zero-mean random noise vector with 
known covariance matrix, say, R, = Evv*. The covariance matrix of x is also known 
and denoted by Exa* = R,. Both (z,v) are uncorrelated, i.e., Ezv* = 0, and we 
further assume that R, > 0 and R, > 0. We established in Thm. 5.1 that the linear 
least-mean-squares estimator of x given y is 


$&-[R; -H'RQH| H'Ryy (31.2) 
and that the resulting minimum mean-square error matrix is 
[Rz! + H* R;!H] - (31.3) 


Deterministic Problem 
Now consider instead deterministic variables (z, y) and a data matrix H relating them via 


6814) 


where v denotes measurement noise. Assume further that we pose the problem of estimat- 
ing x by solving the weighted regularized least-squares problem: 


min [z*Ilz + |y — H2liy ] (31.5) 
z 
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where II > 0 is a regularization matrix and W > 0 is a weighting matrix. We showed in 
Thm. 29.5 that the solution ĉ is given by 


$-[I-H*WH| H*Wy (31.6) 
and that the resulting minimum cost is 
€- y* [W71 - HIH] y (31.7) 


Equivalence 

Expression (31.2) provides the linear least-mean-squares estimator of æ in a stochastic 
framework, while expression (31.6) provides the least-squares estimate of x in the deter- 
ministic framework (31.5). Still, it is clear that if we replace the quantities in (31.2) by 
R, —— II! and R, —— W 1, then the stochastic solution (31.2) would coincide with 
the deterministic solution (31.6). We therefore say that both problems are equivalent. Such 
equivalences play a central role in estimation theory since they allow us to move back and 
forth between deterministic and stochastic formulations, and to determine the solution for 
one context from the solution to the other. Table 31.1 summarizes the relations between 
the variables in both contexts. We now illustrate an application of this important result in 
the context of adaptive filtering. 


TABLE 31.1  Equivalence of the stochastic and deterministic frameworks. 


Stochastic Deterministic 
Random variables {x, y Deterministic variables (z, y} 
Model y = Hz +v Model y = Hr +v 
Covariance matrix Rz Inverse regularization matrix II! 
Noise covariance Ry, Inverse weighting matrix W~! 
$ $ 
min E (x — Ky)(z - Ky) min [z^ TIz + |y — Halli] 
RS -H'RSH| H'RQy @= [II + H'WH] ! H*Wy 
m.ms.e. = [Rz;!--H*R; H] | | Min. cost = y" [W ^! + HIT 1 H*]^ ly 


31.2 KALMAN FILTERING AND RECURSIVE LEAST-SQUARES 


The equivalence established in Table 31.1 between stochastic and deterministic least-squares 
problems can be used to clarify the relationship that exists between Kalman filtering and 
RLS algorithms. This relationship is useful for at least two reasons: 


1. Kalman filtering theory is well studied and many algorithmic variants have been 
developed over the years. Therefore, by establishing a connection between Kalman 
filters and RLS filters, it becomes possible to share ideas and algorithms between 
both domains. 


2. The RLS filter is not equivalent to a full-blown Kalman filter, but only to a special 
case of it. This fact suggests that extended RLS schemes can be developed with 
enhanced tracking abilities, and we shall pursue this extension in Sec. 31.A. 


In this section, we limit ourselves to showing how RLS is equivalent to a special case 
of the Kalman filter. The arguments presented here are also applicable to vector-valued 
observations y;, in which case we would be able to establish the relation between Kalman 
filtering and the block RLS filter of Prob. VII.36 — see Sec. 31.A. However, for simplicity 
of presentation, we focus on the case of scalar-valued observations. In addition, other state- 
space models (i.e., other than (31.8) below) could be used for the same purpose of relating 
Kalman and RLS filters — see, e.g., (31.41) and (31.40) and the footnote following them. 

Thus, consider a collection of zero-mean scalar-valued random variables (y(i), 0 € i < 
N} that satisfy a special state-space model of the form: 


ATI g, 


uid + v(i) Gre) 


with 


Ev(i)u*(j) = ði; Eav*(i) =0 (31.9) 


E £o = 0, 


E xox = Ho, 


This model is a special case of the general state-space model (7.8)-(7.10) and it corre- 
sponds to the choices 


F,-A?pL Gj-0, Ri =1, H; =u; (arow vector) 


where A is a positive scalar less than or equal to one (0 « A < 1). From Alg. 7.1 we know 
that the corresponding Kalman filtering equations are given by 


Teli) = 1+ u; Pji-iu; 

kp = AU? Pa Qur/re(i) 

v(i) = yw(i)- uiĝii-ı 

z 21/25 2 x 31.10 
Tiii = A V2$gii 4 + kp iv (i), Loj-1 = 0 ( ) 


* 
ili-1Uj uiPjii 


Po-1 = Il 
1+ uPjiu; , ne ? 


Pags = A [nea - 


where the notation 2;.. ; ; denotes the linear least-mean-squares (I.I. m.s.) estimator of 2;41 
using the observations (yo. Y1, ..-, Yı}. For convenience of exposition, we are denoting 
the innovations variable of the Kalman filter by v(i), as opposed to the symbol e(i) used 
in Chapter 7. This is done here in order to avoid a conflict of notation with the symbol e(2) 
used to denote the output error of RLS. 

Actually since, by virtue of (31.8), 2:41 is a scaled version of zo, we end up obtaining 
the estimator of £o from these same observations. Indeed, observe by iterating (31.8) that 
Lo = \Gt1)/2 7j; 80 that 
CD28 t 
= |.l.m.s.e. of zo given (y(0), y(1),...,y(z)} 


Loi = A 


We therefore conclude that the Kalman filtering equations (31.10) allow us to solve the 
stochastic problem of estimating zo from the observations y = col{y(0),..., y(i)}, for 
any i. If we run recursions (31.10) from i = 0 toi = N, then we end up with the 
l.l. m.s. estimator of £o given the N observations (y(0), y(1),...,y(N)}: 

= ADHD vin 
l.l.m.s.e. of zo given y = col(y(0), y(1),...,y(N)} 


Loin 
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Observe further that, in view of the state-space model (31.8), the variables (zo, y] are 
related via the linear model: 


y(0) 1 ' uo v(0) 
y(1) WE u v(1) 
y(2) | = A uz | zo | v(2) | 611D 
y(N) [A73] UN v(N) 
—Á— D er < 
a A^ A 
=y =H =v 
that is, 
y= Hao v 


where we are denoting the matrix multiplying zo by H. 

Now we can refer to the equivalence result of Table 31.1 in order to characterize the 
deterministic problem that corresponds to the stochastic problem of estimating £o from 
y via recursions (31.10). Indeed, from Table 31.1 we find that the desired deterministic 
problem is 

min [zl zo + lly- Holl] 


which we can rewrite as 


N 
min xII zo + » lu(3) — ujz;[? (31.12) 
j=0 


for variables x; satisfying 2j41 = Act 274. In summary, we arrive at the conclusion 
stated in Lemma 31.1 further ahead. 


Relation to Exponentially-Weighted RLS 

The point to stress here is that we have arrived at a recursive solution to the determinis- 
tic least-squares problem (31.12) by appealing to equivalence with the stochastic solution 
(31.10)-(31.11) and not by solving it afresh. We can take this argument a step further and 
show that the recursive solution of Lemma 31.1 is in effect the exponentially-weighted RLS 
filter of Alg. 30.2. Once this is done, our arguments would have clarified the connection 
between Kalman filtering and RLS filtering, namely, that the exponentially-weighted RLS 
problem follows from applying the Kalman filter to the special state-space model (31.8). 

To see this, let us consider the regularized least-squares problem (30.27), namely, 


N 
min | A *Du*TIw + Xand) - ujw|? (31.13) 
j=0 


where II > 0 is a regularization matrix. We denote its solution by wy; it is the estimate for 
w that is based on the data {d(j), uj) between times j = 0 and j = N. Although at first 
sight, this cost function is not of the same form as the cost function appearing in (31.12), it 
can be reworked into that form with a suitable change of variables. Thus, observe first that 
solving (31.13) is equivalent to solving 


N 
" * -j ye 
min | w QT)w + = |d(3) — ujw (31.14) 


where we have extracted the constant factor A". Observe further that the cost function in 505 


(31.14) can be rewritten as SECTION 31.2 
KALMAN 
FILTERING AND 
2 RECURSIVE 
LEAST-SQUARES 
A X | d(j) w 
J(w) = w'(MIDw + : 
» (X y X 
= ao pe zj 
N 
= (Mizo + X uG) — wsl? (31.15) 
j=0 


where we defined the quantities, 


44 ay) zn ê 
y(J) — Wi j (VNI D To 


=w (31.16) 


Now it follows from the definition of z; in (31.16) that it satisfies z;,; = A-V?z;. 
Therefore, solving (31.13) is equivalent to solving 


min zo(All)zo + 3l) j)- u;s;| subjecto z;44 = A7 V2; 


j=0 


This problem is now of the same form as (31.12) with the identification IIg 1 = AII and, 
therefore, its solution can be computed by appealing to the recursions of Lemma 31.1, 
namely, we start with Po; = A~*I1~1, 29/1 = 0 and repeat for i > 0: 


Teli) = l-uijPQiuj (31.17) 
kpi = AC Pg yut/re(i) (31.18) 
v(i) = w(i)-wifiii (31.19) 
ĉii = A âii- + kpv li) (31.20) 


* 
Py-1upuiPyi-1 


31.21 
lcu Piu; : ! 


Pag = AM |Pga- 


Then at each iteration 4, it will hold that 
- = \@+1)/25,. "T 
where $9; is the solution of (31.13) using data up to time i, i.e., Zp; = w; where w; solves 


; (i1), i-5 | a0) — aanl? 
min À ee |d(j) — ujw| 
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Lemma 31.1 (Equivalent deterministic problem) Consider a set of (N + 1) 
deterministic data {y(i), us}®o, where the y(i) are scalars and the u; are 
row vectors. Consider further n x 1 vectors x; that satisfy z;, = A7!/2a;, 
for a positive real scalar 0 « A < 1, and let IIo be a positive-definite matrix. 
Then the solution of the regularized least-squares problem: 


min zolgzo + 3 ly) = ujz,|? 
j-0 


is equal to AUV*9/22y | iy, where 2x41 is recursively computed as fol- 
lows. Start with Po) = IIo, 29|.; = 0 and repeat for i > 0: 


ret) = l-uiPg-iui 
kpi = XP Py ut/re(i) 
v(i) = y(t) — wi£ii-i 
Sigs = M MA + kp av (i) 


Pj 1ujusP 1 


Paay = X Pa - 
+1} | i|li-1 1 F uiu 


Moreover, at each iteration i, it holds that 29; = AC+1)/2ê; 1; where Zo}; 
is the solution of 


min zolly zo + sul j) — ujzjl? 
j-0 


using data (y(j), uj) up to time i. 


In order to verify that the recursions (31.17)-(31.21) indeed agree with the exponentially- 
weighted RLS filter of Alg. 30.2, we simply need to rewrite them in terms of the RLS 
variables {d(i),w;} as opposed to the Kalman variables (y(?), 2;;-;). Thus, using the 
substitutions 


(31.22) 
in (31.20) leads to 
Wi = y-1/2 Winn o, F [58 - md 

Vx VO A 

or, equivalently, by multiplying both sides by vV A**1, 
Wi = Wi-1 + AVR (d(i) = wiwi] 
or, by using the expression for k, ;, 
Pjiiuj 
Wi = wc - [d(i) — uiwi-a (31.23) 


14+ uP- 


Moreover, we already know from (30.19) that the RLS variable P; satisfies the recursion 
Po =)Pri+ufu, PSI 


while by applying the matrix inversion formula (5.4) to recursion (31.21) for P,+1); we 
find that 


P- =A) Pi 


1 -l _ 
ill i|li-1 EE utu; QE = All 


0|-1 


Comparing the recursions for the RLS and Kalman variables {P7}, 5p i ;] we conclude 


that they are related via 
Pir A Byer (31.24) 


so that (31.23) ends up agreeing with the desired RLS equation from Alg. 30.2, namely 


AC Piu 
DFA TP gug 40) 7 uci 


i 


Wi = Wi-1 + 


In summary, the above argument shows that the exponentially-weighted RLS solution of 
Alg. 30.2 is equivalent to the linear least-mean-squares problem of estimating the vari- 
able zo from {y(0), y(1),..., y(N)} in the model (31.8) once the Kalman variables are 
translated into the RLS variables by using the relations (31.16), (31.20) and (31.24). 

Actually, there are additional identifications that we can make between the Kalman vari- 
ables and the RLS variables. Recall, for instance, that with the RLS problem we associate 
two residuals at each time instant i: the a priori error 


e(i) = d(i) — UjWi-i 
and the a posteriori error 
r(i) 2 d(i) — uiwi 
These residuals can be related to the innovations variable v(i) as follows. First note that 
1 1 


Wy (d(i) — uiwi-i]- UA e(t) 


v(i) = y(i) m Uifii-i = 


while 


r(i) = d(i)—- ww; = d(i)- (Vd) uisi 
- di) = (VAH uu [Aiii + pv) 
[d(i) — ui wi-1] = Mikey ie(i) 


[1 E Viuikpi| e(i) 


ll 


ui Pii-1u; "E 
= h m Sr] e(i) = e(i)/re(i) 
This means that the conversion factor, which converts the RLS a priori error e(i) to the 
RLS a posteriori error r(i), and which we have denoted by y(i) in Alg. 30.2, is equal to 
rz (i) (the inverse of the innovations variance). Table 31.2 summarizes the correspon- 
dences between the Kalman variables and the RLS variables. This table is useful for the 
following purpose. By writing down any of the available algorithmic variants for Kalman 
filtering for model (31.8) (e.g., cf. Apps. 35.A and 37.A), and by using the correspondences 
from Table 31.2, we can obtain the corresponding RLS variant. Actually, this point of view 
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TABLE 31.2  Correspondences between Kalman and RLS variables. In the Kalman case, 
realizations of the random variables are listed (using normal font) rather than the random variables 
themselves (which would have been described in boldface font). 


Description Kalman variable RLS variable Description 
Measurement [ xw) — | d(i)/(V À) Reference signal 


State vector w/( VÀ) Weight vector 


State estimator wi/ (VAt? Weight estimate 
Error covariance ATIP, Inverse of coefficient matrix 
Gain vector A729, Gain vector 
Innovations | v | ev" A priori error 
r(i)re(i)/(VA)" A posteriori error 
Innovations variance y-t (i) Inverse of conversion factor 


Initial condition | —— 2oji |] W-i Initial condition 


Initial covariance Aum Regularization matrix 


can be used to derive all of the RLS variants described in this book — see Probs. VIII.12 
and VIII 13, as well as Prob. IX.16. 


31.A APPENDIX: EXTENDED RLS ALGORITHMS 


The arguments in Sec. 31.2 show that the exponentially-weighted RLS solution of Alg. 30.2 is equiva- 
lent to the linear least-mean-squares problem of estimating xo from the observations of model (31.8). 
Now model (31.8) is a special one and, therefore, the RLS filter is equivalent not to a full-blown 
Kalman filter, but only to a special case of it. 

In this section we describe the general deterministic criterion that is equivalent to a full-blown 
Kalman filter. In so doing, we arrive at extended RLS algorithms that are better suited for tracking 
the state-vector of general linear state-space models, as opposed to tracking the state-vector of the 
special model (31.8). 


Deterministic Estimation 
Consider a collection of (N + 1) measurements (d;), possibly column vectors, that satisfy 


129 


where the {z;} evolve in time according to the state recursion 


126 


Here the (F;, Gi, Ui) are known matrices and the (n;, vi) denote disturbances or noises. For gen- 
erality, we are allowing the observations (d;) to be vectors. Let further IIo be a positive-definite 
regularization matrix, and let (Q;, Ri} be positive-definite weighting matrices. Given the {di}, we 
pose the problem of estimating the initial state vector xo and the signals (no,ni,...,nw) ina 
regularized least-squares manner by solving 


N N 
min zollg!zo + di — Uizi) R; (di — Vizi) +Y niQi ni (31.27) 
, [Peto 2o + $ FR ) 2 Q 


{200s n rur 


subject to the constraint (31.26). We denote the solution by [2ojw, fj, 0 € j € N}, and we refer 
to them as smoothed estimates since they are based on observations beyond the times of occurrence 
of the respective variables (zo, n; }. 


In principle, we could solve (31.27) by using variational (optimization) arguments, e.g., by using 
a Lagrange multiplier argument. Instead, we shall solve it by appealing to the equivalence result of 
Table 31.1. In other words, we shall first determine the equivalent stochastic problem and then solve 
this latter problem to arrive at the solution of (31.27). This method of solving (31.27) not only serves 
as an illustration of the convenience of equivalence results in estimation theory, but it also shows that 
sometimes it is easier to solve a deterministic problem in the stochastic domain (or vice-versa). In 
our case, the problem at hand is more conveniently solved in the stochastic domain. 

Define the column vectors 

z = col(zo, no, m1,...,nn} 


and 
y = col(do, di, ...,dwN] 


as well as the block-diagonal matrices 


W-' Ê (RheRe..eRx) II! Ê (MeQe...eQn) (31.28) 
Then the term 
N 
zólly zo + YOQ ni 
i=0 
that appears in (31.27) is equal to z"IIz. Moreover, by using the state equation (31.26) to express 
each term U;x; in terms of combinations of the entries of z, we can verify that 


N 
S (d: — Uii) R7? (di - Uia?) = (y - Hz)! W(y - Hz) = lly — Hzllw 


i=0 


where the matrix H is block lower-triangular and given by 


Uo 
U19(1,0) UiGo 
U2 8(2, 0) U59(2, 1)Go U2G1 
H ê (31.29) 
Un®(N,0) Un®(N,1)Go Un®(N,2)G1 ... UnGn-1 0 


and the matrices ®(i, 7) are defined by 
aij) ê { eae. ee 
i=j 
In other words, we find that we can rewrite the original cost function (31.27) as the regularized 
least-squares cost function 


min [2"Ilz + (y— Hz)'W(y — Hz)] (31.30) 


Let 2,w denote the solution to (31.30), i.e., Zw is a column vector that contains the desired solutions 
{fojn Row, fius... fiu } Now, in view of the equivalence result of Sec. 31.1, we know that 
ĉn can be obtained by solving an equivalent stochastic estimation problem that is determined as 
follows. 


Stochastic Estimation 


We introduce zero-mean random vectors (z, y}, with the same dimensions and partitioning as the 
above (2, y), and assume that they are related via a linear model of the form 


y- Hz-v (31.31) 
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where H is the same matrix as in (31.29), and where v denotes a zero-mean additive noise vector, 
uncorrelated with z, and partitioned as v = col(vo, vi,..., vw). The dimensions of the (vi) are 
compatible with those of {y,}. We denote the covariance matrices of (z, v) by 


R,=Ezz*, R,=Evv" (31.32) 
and we choose them as R, = II^! and Ry = W^, where (II, W} are given by (31.28). 
Let £|w denote the l.|.m.s. estimator of z given the entries (do, di,..., dw) in y. We further 
partition z as 
z = col(zo, no,mi,...,nu) 


Then the equivalence result of Table 31.1 states that the expression for Z; in terms of y in the 
stochastic problem (31.31) is identical to the expression for Zw in terms of y in the deterministic 
problem (31.30). 

In order to determine |w or, equivalently, (£o|w; fi;| w }, we start by noting that the linear model 
(31.31), coupled with the definitions of ( R;, Ry, H} in (31.28), (31.29), and (31.32), show that the 
stochastic variables (d;, vi, £0, n;) so defined satisfy the following state-space model: 


gii = Fixi + Gini 
d; = Vigi + vi 


(31.33) 


with 


(31.34) 


We now use this model to derive recursions for estimating z (i.e., for estimating the variables 
{xo,71,..., nw. 


Solving the Stochastic Problem 


Let 2; denote the linear least-mean-squares (I.I.m.s.) estimator of z given the top entries {do,..., di} 
in y. To determine 2);, and ultimately 2|, we can proceed recursively by employing the innovations 
fei) of the observations {y, }. In this appendix, we stick to our standard notation for the innovations 
and use e; instead of vi, which was used in the previous appendix. There is no need here to avoid 
confusion with the error signal for RLS. Using the basic recursive estimation formula (7.6) we have 


£j = ĉu- + (Ezei) Rei €i 


Zj-iT (E2Bi-1) US Roi ei, 2)-1 =0 (31.35) 


where we used in the second equality the innovations equation (cf. (7.21)): 
ei = di-Ui;$iii = Uidii-i + vi 


and the fact that Ezov; = 0 and En;v? = 0 for all j. Clearly, the entries of 2;; have the interpre- 
tation 
Zu = col(oj, fioi fj um ;fi-ilü 0, 0, e.. 0} 
where the trailing entries of 2), are zero since fj; = 0 for j > i. 
Let Kz: = E 2Bi)i-1- The above recursive construction would be complete, and hence provide 
the desired quantity 2; w, once we show how to evaluate the gain matrix K,,;. For this purpose, we 
first subtract the equations (from the Kalman filter Alg. 7.1): 


Ziyi = Fig; + Gin; and B41; = Fiii + Koa[U;£ iii + vi] 


to obtain 
Tipi = Foi Gini — Kp ivi 


where F,,; = Fi — K,,,U;. Using this recursion, it is easy to verify that Kz; satisfies the recursion: 
~k * * Io 
Ku = E ztipij = KP T I QiGi, Ko = 0 (31.36) 
0 
The identity matrix that appears in the second term of the recursion for Kz,:+1 occurs at the position 
that corresponds to the entry n; in the vector z, e.g., 


* IloFzoF71 
II F G ri 
s QoGiFoa 
Kz1= | QoG |, Kz2= Q Gt , 
0 $ 2 


Substituting (31.36) into (31.35) we find that the following recursions hold: 


Zo; = Soli + IIo (i, 0)U? R,iei, ĉoj-1 =0 
Aj = Agia + QIGFOLG, G+ DUT Rejei, j<i (31.37) 
&;j; = 0, G21 


where the matrix ®, (i, j) is defined by 


xA Fy i-1Fpi-2... Fj i»j 
pli j) al E A = 


If we introduce the auxiliary variable 
ac 
PiN = D o5(j, 3)U; Re je; 
j-i 


then it is easy to verify that recursions (31.37) lead to 


Zon = Ilopo 
$jajy = Fp j@jj-1 + Kpgdj, ĉo- = 0 
ej = d; = Ujiji (31.38) 
Ayn = QjGjpyaw 
Pin = FpaPjyyn tUPRoje3, Purn =0 


These equations are known as the Bryson-Frazier recursions in the literature of Kalman filtering; 
the recursions (31.37) evaluate the estimators {ĉo]:, jji} for successive values of i, and not only 
fori = N as in (31.38). Just like (£o|w, sın }, the estimators {%o);, ĉj} can also be related 
to the solution of a least-squares problem. Indeed, by equivalence, the expressions that provide the 
solutions {ĉo}, ft;j;) in (31.37) should coincide with those that provide the solutions (£9j;, Ajj} 
for the following deterministic problem, with data up to time i (rather than N as in (31.27)): 


min — |zóllo zo + Y (d; — Ujz;) Rj (d; — Ujz;) + Y nj Qj nj (31.39) 


j-0 j-0 


Solving the Deterministic Problem 


We know by equivalence that the mappings from (d;] to (£o|w, jin in the stochastic problem 
(31.31) coincide with the mappings from (d; to (£o, fij; w ) in the deterministic problem (31.30). 
We are therefore led to the following statement. 
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Algorithm 31.1 (Extended RLS) Consider measurements (di) that satisfy a state- 
space model of the form 


tii = Fixit Gini, di = Uzi + vi 
where the {F;, Gi, U;) are known matrices. The solution {ĉon fij } of 
N N 
"E ® an) orp zo + La — Uizi)' R71 (d; — Uizi) + 2 "iQ 


can be determined recursively as follows. Start with ĉoj- = 0, Poj-1 = Ilo, and 
DN4A|N = 0, and run the following equations forward in time from i = 0 to i = N: 


Rei = R+ UP- U7 
Kp: = Fi Pai iUt Roi 
ei = di —- Uit 
$a = Fiji Kei 
Pagi = FP. Fi + GiQiGi — KpiRei Ka 


Then run the following recursion backward in time from 7 = N down to i = 0: 
pin = FRipisin + US Ro je 


and set 


fon = Hopon 
and 


fun —QiGipiiw for OSi<SN 


Special Cases 


Algorithm 31.1 is a generalized RLS algorithm and has superior tracking abilities. The generalization 
is in several respects: 


1. To begin with, the measurements (d;) are allowed to be vector-valued. In most of our treat- 
ments so far, we have assumed scalar-valued observations arising from a linear model of the 
form 

d(i) = uiw? + v(i) 


for some unknown weight vector w° to be estimated. 


2. Block RLS. Algorithm 31.1 assumes that the state vectors {x;} that determine the mea- 
surements (d;) evolve in time according to the state-space recursion 


Liga = Fixi + Gini 


This assumption is more general than assuming d; = U;w° + vi with an unknown constant 
weight vector w°. This latter situation can be modeled as xi41 = 2; with zo = w°. That is, it 
corresponds to the special choices F; = I, G; = 0 and zo = w°. In this case, the recursions 
of Alg. 31.1 reduce to the following. Start with @o)_1 = 0 and Pj. = Ho and repeat for 


i>0 
Rei = Ret Ui Pi. U? 
Kpi = Pj aU? Rz 
ei = di-Uidiu-i 
iyii = Bisa + Kpc: 
Pisin = Pi-1 — KpiRei Kp: 


Now since x; = zo, the quantities Z;);_1 correspond to %o|;_1; i.e., they provide recursive 
estimates for zo and there is no need for the p;| y recursion. If we denote ĉoj; by wi and P;,1j; 
by P;, then the above equations will agree with the block RLS algorithm of Prob. VII.36. 


. RLS. Clearly, Alg. 31.1 also subsumes the exponentially-weighted RLS filter of Alg. 30.2 as 
a special case. Comparing (31.14) with (31.39), we see that (31.14) is a special case of the 
latter if we select 
R=, I-AH (31.40) 
for some scalar 0 «& A < 1 positive-definite regularization matrix II. Moreover, we should 
also select d; to be a scalar, say, d(z), and U; to be a row vector, say, u;, and assume that they 
are related via 
Tii zi,  d(i)-— wizi + v(t) (31.41) 
This model corresponds to the special choices F; = I and G; = 0. Then the recursions of 
Alg. 31.1 would reduce to the following. Start with ĉoj- = O and Po_1 = AHIT! and 
repeat for i > 0: 


rei) = à+ ui Pji-iui 
kpi = Pu-i} /re(i) 
e(i) = d(i) — wii 

iyii = Bis + kp eli) 


Piss = Pict » Py-iuiuiPii-i 
M + ui Pip- 

In this case again, since z; = zo, the quantities Z,;;_1 correspond to #9);_ 1; i.e., they provide 

recursive estimates for zo and there is no need for the p; y recursion. If we denote ĉoj; by wi 

and use the easily verifiable fact that now Pj.1); = ATP, then the above equations will agree 

with the exponentially-weighted RLS filter of Alg. 30.2. 


. Other state-space models. In the equivalence argument of App. 31.2 we relied on 
a different state-space model than (31.41) to arrive at the RLS algorithm. In particular, in 
model (31.8), we used F; = A~1/?I, R(i) = 1, and scaled the measurements d(i) as in 
(31.16). The resulting state-space model had a constant R; whereas (31.40) uses R; = X. 
While in (31.24) we had P,,;-1 = A^!Pi.i, we now have Pj;:i = AP. These 
constructions illustrate that there is freedom in selecting the underlying state-space model. 
One key advantage of using (31.8) is that this model belongs to the class of structured state- 
space models (cf. defined later by (37.25), for which fast estimation algorithms can be readily 
derived. Choosing R; = AŻ, on the other hand, leads to a time-variant model, as remarked in 
the Bibliographic Notes at the end of this part. In this way, model (31.8) becomes useful not 
only in establishing the connection between plain RLS and Kalman filtering, but also between 
other RLS and Kalman filtering variants (especially between their fast variants). We shall 
explain these connections in App. 37.A and Prob. IX.16. 


. Tracking. Algorithm 31.1 suggests extensions of RLS in order to track variations in the 
weight vector that can be captured by linear state-space recursions. Thus, assume, for exam- 
ple, that the measurements {d(i)} satisfy 


d(i) = uw? + v(i) 
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and that the unknown weight vector w? evolves in time according to 


o o 
Wi41 = QW; +N 


for some scalar |o| € 1 and disturbance n;. This model is a special case of the state-space 
model used by Alg. 31.1 if we make the identification x; = w? and if we set 


Fi-ol Gi-I 


Then, again, if we select IIJ 1 = AI, R; = A, and Q; = qX1, for some positive scalar q 
and some scalar O < A < 1 used to introduce exponential weighting,!^ then the recursions 
of Alg. 31.1 would reduce to the following. Start with ĉoj- = 0 and Py. = AT II^! and 
repeat for i > 0: 


r.(i) = X ruPgiu 
kpi =  oPQquliu/re(i) 
e(i) = d(i)- wii 
$i = Odii-i + kp ie(t) 
Pii-iutu: Pii 
P. "n 2 P= du 4 ili i 
i+lji lal | i|i-1 X xu Piin dE Aql 


In this case, since x; = w?, the quantity Z;);_1 serves as an estimate for w? that is based on 
measurements up to time 7 — 1. If we denote 2,41; by w: and if we introduce the change of 
variables ^ 
Po — A" Pun 

then by multiplying the recursion for P; 1; by A7* on both sides, the above equations can be 
rewritten equivalently in the following form. Start with w~; = 0 and P_, = II^! and repeat 
fori > 0: 

A aPi-i ul 


T+ Piney AY — ttt 


Wi = awi-1 + 
(31.42) 
AT P-utuiPi1 
1 +à Tlu Piu} 


P; = A~a]? [Po - 


EE 


These recursions collapse to the exponentially-weighted RLS filter of Alg. 30.2 when a = 1 
and g = 0. The associated cost function is 


N N 
min AND we Ty? + ANT (i) — uiw? +7) V^ AT nil? 
dues eid | o o 2 |d(i) iw; | q 2 [niil 
subject to 2,1 = aw? + ni. Note that if we extract a factor AN out, then the cost function 
becomes of the same form shown in Alg. 31.1 for the choices R; = A', Qi = qA'l, and 


Ts? = All. 


. RLS and Kalman variants. While in future chapters we focus almost exclusively on the 


exponentially-weighted RLS scheme of Alg. 30.2, and derive several variants for it, many of 
these developments can be extended to the more general RLS schemes shown above. These 
extensions could be pursued afresh by repeating the arguments of the subsequent chapters. 
Alternatively, once the connection between Kalman filters and extended RLS filters has been 
established, most of the algorithmic variants that exist for Kalman filters can be applied im- 
mediately to extended RLS filters. 


140Of course, more general choices for R; and Q; are allowed by Alg. 31.1. These special choices are for 
illustration purposes only. 
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Order and Time-Update Relations 


b chapter can be skipped on a first reading. Its results will only be needed in later 
chapters when we study fast fixed-order and order-recursive least-squares filters in Parts 
IX (Fast RLS Algorithms) and X (Lattice Filters). 

However, there is a reason why we choose to present the material at this location, and 
not in later chapters. The reason is that the results we present here are often derived in 
the literature in the absence of regularization and under the assumption of some structure 
in the data matrix H (such as requiring its successive rows to be shifted versions of one 
another, as explained later in the introductory remarks to Chapter 37). In comparison, the 
derivation given here indicates that the results hold irrespective of data structure, i.e., for 
any H. In addition, the arguments incorporate regularization and clarify its role in order- 
update relations. In so doing, the results will allow us to provide later in Chapter 40 a 
treatment of lattice filters in the presence of regularization. The results also allow the class 
of lattice filters to be extended to more general data structures, other than the classical 
tapped-delay-line structure, as was shown in detail in Chapter 16 of Sayed (2003). 

For now, it suffices to treat the material in this chapter as an application of the concepts 
and geometric constructions of the earlier sections. 


32.1 BACKWARD ORDER-UPDATE RELATIONS 


Consider a weighted regularized least-squares problem of the form 


min [ww + (y — Hw)*W (y — Hw) | (32.1) 


whose optimal solution is given by (cf. Thm. 29.5): 


© = (I+ F'"WH) ! H'Wy (32.2) 


PB (nomÓwmu) (32.3) 


In other words, P is the inverse of the coefficient matrix that appears in the normal equa- 
tions 


Let 


(II+ H*WH)® = H'Wy 
Then 
@ = PH*Wy 


and the estimate of y is 
y= H = HPH*Wy (32.4) 
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We refer to # as the regularized projection (or simply projection) of y onto the range space 
of H. Recall from Sec. 29.5 that, when H has full column rank, the actual projection 
matrix onto R(H) is defined by 


PH = H(H'H)'!H* 
For the regularized problem (32.1), we instead have 
g-H(I-H'WH) H*'Wy = HPH*Wy 


Although the matrix H PH*W that multiplies y is not an actual projection matrix (cf. (29.19) 
and (29.20)), we shall still refer to Ẹ as the projection of y onto R(y) for ease of reference. 
Note that in both cases, with and without regularization, 7 € R(H). 

The resulting residual vector is 


g-y- Hà 


and the corresponding minimum cost is (cf. Table 29.3): 


ans 


where u denotes the last row of H. This scalar plays an important role in least-squares 
adaptive filtering, and it is called the conversion factor for reasons to be explained later in 
Sec. 30.4. We shall not employ ~ in any of the arguments below, but will only comment 
on some of its properties whenever appropriate. 

Now assume that we extend the data matrix H in (32.1) by adding one column to it, say, 
h, and consider the extended least-squares problem: 


Introduce the scalar 


min rok olu + w= [a h |w)" W(y-[H hje} (32.7) 


We say that the order of the estimation problem has increased by one since we are now 
estimating y from the column span of the extended data matrix [H h}. Of course, the 
corresponding weight vector, w,, has one dimension higher than w and, accordingly, we 


also extend the regularization matrix, II, by adding a positive scalar ø to it.!5 
The optimal solution of (32.7) is given by (again cf. Thm. 29.5): 


sal of + |e lwla nl) [£m (32.8) 


(32.9) 


151m Prob. VII.18 we show that the results of this section will still apply, with minor modifications, even if the 
new regularization matrix were not merely a diagonal extension of II but, more generally, of the form 


for some row vector m and positive scalar c. 


so that the new estimate of y is given by 


(32.10) 


We say that 9; is the regularized projection of y onto the range space of the extended data 
matrix [Hh]. The resulting residual vector is 


Ve = y e [ H h ] Dz 
and the corresponding minimum cost is (cf. Table 29.3): 


(32.11) 


The associated conversion factor is 
A * 
4 =1-[u a]A| | (32.12) 


where [u a] denotes the last row of [H hl, 


[ H IDE | 


Our objective is to examine the relations between the solutions of the least-squares prob- 
lems (32.1) and (32.7). We shall use both algebraic and geometric arguments for the sake 
of illustration. We start with the algebraic argument and later show how the geometry of 
least-squares theory can be used to arrive at the same conclusions. 


Algebraic Argument 
The first step toward relating the solution vectors (i, @, } in (32.2) and (32.8) is to relate 
the coefficient matrices ( P, P,} in (32.3) and (32.9). To achieve this, we start by observing 
from (32.9) that 
-1 II 0 H* P^ H*Wh 
Bgm | |*| |wla s1- WH — c h*Wh 
(32.13) 


where the top leftmost corner entry of P; ! is seen to be P^, 
PO LI-HUVH 
In order to relate P, to P we invoke the easily verifiable matrix identity: 


-1 "n: e 
e BP [E t] Dosen cce 
(32.14) 


which relates the inverse of a block matrix to the inverse of its top leftmost corner block. 
Applying this identity to (32.13) we obtain 


| [ -h*WHP 1] (32.15) 


1 PM * 
P= | 5 oeil PH*Wh 


0 0 v 1 
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where the scalar v is given by 
v = o + kWh — h*WHPH*Wh = o + h'W(h - HPH"Wh| (32.16) 


The quantity PH* Wh that appears in (32.15) and (32.16) can be interpreted as the weight 
vector that solves the following regularized least-squares problem: 


min [ w*IIw?  (h - HuP)*W(h - Hw’) ] (32.17) 


wb 
Indeed, the solution of (32.17) is 
o = PH'Wh 


and it corresponds to the weight vector that projects (in a regularized manner) the column 
vector ^ onto the column span of H. This projection problem is the reason for the title 
of this subsection, namely, backward projection. The terminology refers to the fact that 
we are projecting ^ onto the preceding columns of [H  h], as indicated schematically in 
Fig. 32.1. 

We denote the resulting estimate and residual vectors of this projection problem by 


= Ho = HPH*Wh 
h-h = h- Hà? = h- HPH'Wh 


UG) 


The last entry of hisa — ud; it corresponds to the error in estimating o and we denote it 
by 


à $ a-u (32.18) 


Also, the minimum cost of (32.17) is given by (cf. Table 29.3): 


En = h'Wh (32.19) 


Using the above definitions of {0}, h} we can now rewrite (32.15) as 


(32.20) 


b 


(D) 


FIGURE 32.1 Backward projection: The last column of [H A] is projected onto the column span 
of the matrix H. 


which provides the desired relation between {P,, P}; the relation is in terms of (c, £n, ^) 
with the last two quantities arising from the backward projection problem (32.17). 
If we multiply both sides of (32.20) from the right by 


(32.21) 


(32.22) 


Likewise, the projections (9, 9z} are related via 


—& i 


K 


6223) 


and, consequently, the residual vectors {¥,, Y} satisfy 


224) 


It is also straightforward to see that the corresponding minimum costs (£, €z } satisfy 


$-[H h]@.=(H h] | » | = H@-na) 46h = iet ah) 


That is, 


é = VyWg = yWiy — wh] = y'Wgy — sy'Wh 
i.e., using (32.21), 


S7 
& —€ TET (32.25) 


3226) 


Finally, if we multiply both sides of (32.20) by [u a] from the left and col(u*, o*) from 
the right, and use the definitions of {+z, y, &} in (32.6), (32.12), and (32.18), we find that 


P 


where we are defining the scalar 


la 


0-6 (32.27) 


o mm ty m 
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Before re-deriving and interpreting the above results by means of geometric arguments, 
we summarize the conclusions in the following statement. 


Lemma 32.1 (Backward order-updates) Consider the regularized least-squares 
problems (32.1) and (32.7), where the data matrix H in the first problem is 

extended by one column to [H h] in the second problem. The corresponding 

solutions (0, @,}, coefficient matrices (P, P,}, residuals (y, J2}, minimum 

costs (£, €z}, and conversion factors (^, yz} are related as follows. 


1. Let © be the weight vector that projects h onto H, according to 
(32.17), ie, © = PH*Wh, where P = (II + H*WH)-'. Let h 
denote the resulting residual vector, h = h — H 4^, whose last entry is 
a. Let also £j, denote the corresponding minimum cost, £j, = h*Wh. 


2. Define the scalars p = y*Wh and x = p*/(o + ĉn). 
Then the following relations hold: 
Ye = J+ kh 


= y-—kh 
& = €-(|pl?/(o+&n) 


Yz 4 — \al?/(o + ên) 
^ pb 
à, EI d 
= P 0 1 -0 ~ bie 
Pe c= o o|*zzzl ? fla 1] 


Special Case: Absence of Regularization 


In order to illustrate how the presence of regularization affects the interpretation of the 
results, let us examine what happens when regularization is ignored. Thus, assume that 
II = 0 and c = 0, so that the backward projection problem (32.17) reduces to 


min (h- Hw*)W(h - Hw") (32.28) 
wW 
Its solution @° is now such that it satisfies the orthogonality condition 
H*W(h — Ha) =0 (32.29) 


and, in this case, we get NNUS 
h*Wh=0 
since h € R(H). This fact allows us to rewrite the minimum cost £;, in (32.19) as 


En = h*Wh = [A+A Wh =h*Wh (32.30) 


and expression (32.21) for x becomes 


K = om—— (32.31) 


This expression identifies « as the coefficient that projects y onto h, i.e., it is the coefficient 
that solves the least-squares problem: 


min (y—- hkW(y-hk) — x 


In other words, when II = 0 and o = 0, the term «hin (32.23) can be interpreted as the 
projection of y onto À. The equality (32.31) does not hold in the regularized case and, 
therefore, we cannot interpret « in that case as the solution to the problem of projecting y 
onto h. That is, in the regularized case (32.17), it does not hold that « solves 


min alki? + (y — Ak) W(y — hk) 


since the solution to this problem would be kK, = jWy/ (c + h*Wh), with the term 
h*Wh appearing in the denominator instead of h* WA, as in expression (32.21) for «. 
Actually, more can be said about « in (32.31). If we write 


y=7+7 
and recall that j € R(H), then by virtue of the orthogonality condition (32.29) we have 
RWG = 0, so that " D 7 
h'Wy = h'W(g-gy)- Wy 
and (32.31) can be replaced by 
hwy 
K= ed 
h*Wh 
That is, in the un-regularized case (32.28), we can also interpret « as the coefficient that 
projects ¥ onto h, i.e., it solves 


(32.32) 


min (j- AkyW(g—hk) => «x (32.33) 


In view of this result, the difference y — Kh in (32.24) can be interpreted as the residual 
vector that results from (32.33). 

In conclusion, from (32.23)-(32.24) and from (32.31)-(32.32), we have that in the un- 
regularized case the new projection 7, can be obtained from the old projection y as follows: 


1. Project h onto R(H) and find the residual vector À. 
2. Project y onto h. 
3. Then 7, = 7 + Pz, where 9; denotes the projection of y onto h. 


4. Also, Jz = J — Kh. 


That is, we obtain the new projection, 7,, by projecting y separately onto H and h and 
adding the results. Alternatively, we can obtain the new residual vector y, of step 4) as 
follows: 


4.a) Find the residual vector jj that results from projecting y onto R(H). 
4.b) Find the residual vector h that results from projecting h onto R(H). 


4.c) Then 9; is the residual vector that results from projecting y onto h. 


This construction of y; in the un-regularized case is depicted in Fig. 32.2. In the figure, the 
arrows indicate that 7 results from projecting y onto R(H) and h results from projecting 
h onto (H). 
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Fz, Jz 
ge Nen 
(t | 
9,9 


FIGURE 32.2 For unregularized least-squares problems, the order-update of the residual vector 
jj into the residual vector 7j, (top figure) is obtained by projecting Y onto h (bottom figure). 


Geometric Argument 
We now re-derive the result of Lemma 32.1 by calling upon some of the geometric prop- 
erties of least-squares problems. To begin with, the major difference between problems 
(32.1) and (32.7) relates to the data matrices H and [ H h ]. In the first problem, we 
are projecting y onto the column span of H while in the second problem we are projecting y 
onto the column span of [H h]; here we mean regularized projections. The column vector 
h is the additional piece of information that distinguishes the two least-squares problems. 
Now assume that we replace [H_h] by a data matrix that spans the same column space 
in the following manner. Let h denote the residual vector that results from the regularized 
projection of h onto R(H), i.e., 


h-h-Há* 


where, as before in (32.17), the weight vector @° is the solution to 


min [w"TIlw* + (h— Hw>)"W(h— Hw’) | (32.34) 
and is given by 


Then we can write 


[H h]-[H iT (32.35) 


which shows that the matrices [H h] and [H h] are related by an invertible transformation, 
so that they span the same column space, as desired. 

If we were dealing with un-regularized least-squares problems, i.e., if II = 0 in (32.34), 
then clearly à L R(H), ie., H*Wh = 0. For the general regularized least-squares 
problem (32.34), however, this orthogonality condition is replaced by (cf. Table 29.2): 


(32.36) 


H*Wh = Tia? 


In either case, with and without regularization, the orthogonality condition and the trans- 
formation (32.35), allow us to re-establish the relations of Lemma 32.1, as we now verify. 


Relating ( P, P). In view of (32.36), and using (32.35), we have 


I> 


Pr} 


z 


where 


ol 


Therefore, using 


and (32.36), we get 


lip 


ID: Oe Oy _]fI 
Q1 0 c-hWR|L[O 


0 H* 
MESEMLIES h | 
0 I 0]f#* ~ [1 @ 
olele a] [e [eta [s T] 
0] [1 o][mwH mwhil[i e 
c a 1]| wH hWh|LO 1 
O|fg,.[|HwH H'Wh 1 d^ 
1 PWH h*Wh 0 1 


(32.37) 


| (32.38) 


where we used the fact that 


cQ". A*"Wh = 


c + IG + (h — Hà*)*WAR 
o+h*Wh+ oF (a — H*Wh) 
—ÁÓÁM— 

=0 by (32.36) 


I 


ot+h*Wh 


i 
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It follows from (32.38) that 


i, oe. 3 I -0 
-@ 1}] 0 —— l[o 1 
c Wh 


P 0 1 | -0 | A 
= + — —i* 1 
| 0 0 | o+h*whl 1 
which coincides with expression (32.20). Observe that this argument avoids the need for 
the algebraic identity (32.14). Moreover, the argument leading to (32.38) becomes more 


immediate if regularization were not present. Indeed, note that when II = 0 and o = 0, 
and using 


P; 


preka WER h], P! HU, h=h+H@, H*Wh=0 


relation (32.38) would immediately follow in the following form 
pia I 0 pot o. I à 
mco ue i 0 h'Wh||0 1 


The argument that led to (32.38) assumes regularization. An alternative argument appears 
in Prob. VIL 13, which is based on reformulating a regularized least-squares problem as a 
standard least-squares problem. 


Relating {@,, ©}. Recall that ©, is the solution to the extended problem (32.7), namely, 
; „|H 0 2 
min faz] a ? us + |ly-[ H h Jelly ) 
Replacing the data matrix {H h) by (32.35), the above problem becomes 
: Tl 0 vaf I @ 
min fufa MEI, ili i E 


This alternative rewriting of the cost function suggests that we introduce 


X 


and solve instead 


min { wiTIws + lly- [ Hh ] well } 
where II is as in (32.37). The solution &, is given by 
* -1 * 
& - (n«|4 wn 3) E |w 


or, equivalently, using (32.38), 


P 0 
D, = 1 
Ws 0 


Then, 


d. = I -à* G. = (B — ki 
2 0 ij E 


which is the same expression we derived before for ©, in terms of w in (32.22) — see 
Prob. VIL14 for an alternative derivation that is based on reformulating a regularized least- 
squares problem as a standard least-squares problem. 


Relating (7. 7}. The projection 7, is given by 
j.-[H h]8.-HG-kh 
which is again the same relation we derived before in (32.23). 


Relating (^. y}. From the definition (32.12) for y, we have: 


B E u 
Yz = 1 ua |e | ae | 
(3238) | — à] I 0 ; I -0 u 
= 3 -~o 1||0 0 1 a* 
o+ h*Wh 
P 0 u* 
= 1-|u a | 0 1 i | a | 
o +h*Wh 
EV. 
= 1-—uPu* i 
ao +h*Wh 
(a? 
Merci ^Y — 
o+h*Wh 


where (o, @} are the last entries of (^, h}. This expression coincides with (32.27) since 
En = h^ WR. 


32.2 FORWARD ORDER-UPDATE RELATIONS 


In Sec. 32.1 we started from the least-squares problem (32.1) and extended the data matrix 
H by adding a column to its right, as in (32.7). Then in Lemma 32.1 we related the 
solutions to both problems. We can similarly consider the problem of extending H by 
adding a column to its left (rather than to its right), and by extending II accordingly, say, 


min CI o [e+ (yc [A H jw W(y-[h H Jes) } (32.39) 


The optimal solution of this extended problem is now given by 


«(s aJ [gJr i) [e] 
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We again denote the inverse of its extended coefficient matrix by 


P= (| 6 Sem wt a] (32.40) 


so that 


The resulting residual vector is 


with the minimum cost of (32.39) equal to (cf. Table 29.3): 


&=y Wy, 


We further associate with (32.39) the conversion factor 


where [a u] denotes the last row of [A H]. 

We can again relate the quantities (P, P.}, {w, wz}, (9, gz}, and (y, Yz} of problems 
(32.1) and (32.39). The arguments (both algebraic and geometric) are identical to those 
in Sec. 32.1. For this reason, we only state the final results here and leave the details to 
Probs. VII.15-—VII.17. 

The relation between problems (32.1) and (32.39) is determined in terms of the solution 
to the following regularized least-squares problem, 


min [wwf + (h - Hw!)*W(h- Huw’) | (32.41) 
wW 


namely, 

af = PH*Wh 
This weight vector performs the (regularized) projection of h onto the column span of H. 
And this projection problem is the reason for the title of this subsection, namely, forward 


projection. The terminology refers to the fact that we are projecting h onto the posterior 
columns of [h H] — see Fig. 32.3. The minimum cost of (32.41) is given by 


£y = Wh (32.42) 


where 
h-h-Rh-2h-Há! =h- HPH*Wh 


Lemma 32.2 (Forward order-updates) Consider the regularized problems 
(32.1) and (32.39), where the data matrix H in the first problem is extended 
by one column to [h H] in the second problem. The corresponding solu- 
tions (i, ©}, coefficient matrices (P, P,}, residuals (7, Jz}, minimum costs 
{€,&}, and conversion factors (^, Yz} are related as follows. 


1. Let © be the weight vector that projects h onto H, according to 
(32.41), i.e., Gf = PH*Wh, where P = (II+ H*WH)-!. Leth 
denote the resulting residual vector, À = h — Hf, whose last entry is 
a. Let also £» denote the corresponding minimum cost, £j = h*Wh. 


2. Define the scalars p = y*Wh and x = p*/(o + ên). 
Then the following relations hold 
Jz = F+ kh 


Ye = y-— «kh 
& = €-|pl?/(o + én) 


Y = *v-|al?/(o + &) 
A^ 0 1 
a [1-4] 
0 0 1 1 * 


Special Case: Absence of Regularization 
Let us examine again what happens when IT = 0 and o = 0 in problem (32.39), which 


then reduces to 
min (h— Hwl)*'W(h- Hw) (32.43) 


Its solution is such that it should satisfy the orthogonality condition H*W (h — H@‘) = 0. 
Therefore, in this case, we also have h*Wh = 0 since h € R(H). This fact allows us to 
rewrite the minimum cost €, = h*Wh in (32.42) as 


En = À'Wh = [h+h]*Wh=h*Wh (32.44) 
so that the expression for « in Lemma 32.2 becomes 


= hÀ*Wy 


LA] (32.45) 
h*Wh 


K 


As explained in the case of backward projection following Lemma 32.1, the above ex- 
pression allows us to identify « as the coefficient that projects y onto h, i.e., it solves the 
least-squares problem 


min (y—hk)*W(y-hk) = x 
In this way, the term kh in the expression for J, in Lemma 32.2 can be interpreted as 


the projection of y onto h. As explained before, the equality (32.44) does not hold in the 
regularized case (32.39). 
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FIGURE 32.3 Forward projection: The leading column of [h H] is projected onto the column 
span of H (top figure). This step helps relate the order-update problems of projecting y onto the 
column spans of H and [h H] (bottom figure). 


More can be said about « in (32.45). If we write 
y=9ty 


and recall that ¥ € R(H), then since nwa = 0 by the orthogonality condition of 
weighted least-squares solutions, we get 


h*Wy =h*W(G+9) =h*Wy 


and (32.45) can be replaced by 
k = => (32.46) 


That is, in the un-regularized case (32.43), we can also interpret « as the coefficient that 
projects ¥ onto h, i.e., it solves 


min (j- Ak"W(g—hk) = x (32.47) 


In view of this result, the difference y — «hin the expression for y, in Lemma 32.2 can be 
interpreted as the residual vector that results from (32.47). 

In conclusion, we find that in the un-regularized case the new projection 7, can be 
obtained from the old projection y as follows: 


FIGURE 32.4 For un-regularized least-squares problems, the order-update of the residual vector 
jj into the residual vector Yz is obtained by projecting 7 onto A. 


1. Project h onto R(H) and find the residual vector h. 
2. Project y onto h. 


3. Then Jz = jj + Fz, where Fz denotes the projection of y onto h. 
4. Also, Jz = J — Kh. 


That is, we obtain the new projection, ¥,, by projecting y separately onto H and h and 
adding the results. Alternatively, we can obtain the new residual vector jj, of step 4) as 
follows: 


4.a) Find the residual vector Ẹ that results from projecting y onto R(H). 
4.b) Find the residual vector A that results from projecting ^ onto R(H). 


4.c) Then 7, is the residual vector that results from projecting ¥ onto h. 


This construction of 7, in the un-regularized case is depicted in Fig. 32.4. In the figure, the 
arrows indicate that Ñ results from projecting y onto R(#) and h results from projecting 
h onto R(H). 


32.3 TIME-UPDATE RELATION 


The discussion in this section complements the one presented in Secs. 32.1 and 32.2. The 
results derived here will only be used later in Part X (Lattice Filters), when we study order- 
recursive adaptive filters. 

Thus, consider an (N — 1) x M data matrix H y. and partition it as 


Hy-1 2 [zy Hy-1 2-1 | (32.48) 


where zy_1 and zw. denote the leading and trailing columns of Hy_1, and Hy.a 
denotes its middle columns. Let Zy.., denote the (regularized) projection of zy..; onto 
R(Hy-1), i.e., 

2n-1 = Hn-1WN-1,2 
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where wy-—1,z is obtained by solving the regularized least-squares problem 
min [ AF w* Tw + (zy-1 — Hy-iw)*An-i(zn-1 — Hy-1v) | (32.49) 
and Ay. , is defined as in (30.25). Let further Zy_1 denote the resulting residual vector, 
Zy-i-zN-i—-ZN-i = 2zn-1 — Hw-iwN-i 


and define the weighted inner product 
A. 4 E 
A(N — 1) = TN-1ÅN-1ŽN-1 


In other words, A(N — 1) is the weighted inner product between the first column of Hy—1 
and the residual Zy —1 that results from projecting its last column onto the middle columns 
Hy_1 — see Fig. 32.5. 

Now assume that one more row is appended to H y. ; in (32.48), say, 


zN-i Hw-i zn-1 (32.50) 


o(N) hw  B(N) 


where a(N) and G(N) are scalars, while hy is a row vector. That is, Hy_, is time- 
updated to Hy. E 
As above, let Zy denote the (regularized) projection of zy onto R( Hy), i.e., 
2n = Hyun, 


where uw, is obtained by solving the regularized least-squares problem 


min [ AO" + (zy — Hyw)'Ax(zu — Hyw) (32.51) 
w 


inner product 


FIGURE 32.5 Inner product between the residual vector Zw; and zwi. 


Let further Zy denote the resulting residual vector, 
ZN — ZN —ZN —ZN-— Hywn,: 
and define the corresponding weighted inner product 
A(N) Ê et ANN (32.52) 


Again, A(N) is the weighted inner product between the first column of Hy and the resid- 
ual vector that results from projecting its last column onto the middle columns Hy — see 
Fig. 32.6. 

We would like to relate A(N) and A(N — 1), i.e., we would like to determine a time- 
update relation for the variable A. For this purpose, let Zw denote the (regularized) pro- 
jection of zy onto R(Hy): 

fy = HNUNaz 


where wy,z is obtained by solving the regularized least-squares problem 
min | AU FDwfTIw + (zy — Hyw)*An (ty — Hyw) | (32.53) 
Introduce the estimation errors 


(N) = a(N)-hywn.s 


à 
BIN) = B(N)-hwww, (32.54) 


Here, G(N) is the a posteriori error in estimating the last entry of z y, while B(N ) is the a 
posteriori error in estimating the last entry of zy. 


i i | 
jm |e Ay- How 


inner product 


FIGURE 32.6 Inner product between the residual vector Zw and zw. 
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Define further the conversion factor 
(N) Ê 1-hwPwht (32.55) 


where 


Py = peor + Ak An By] P 


Clearly, ?(N) is the factor that relates the a posteriori error B(N) to its a priori version, 
which is defined by 


BN) = B(N) -hywn-1,2 
with wy_1,z used instead of wy,;. That is, 
B(N) = (N)Ba(N) 


Actually, 7(V) is also the factor that relates the a posteriori error G(N) to its a priori 
version, which is defined as 


@a(N) = a(N) - hnwn-1,2 


with wy_1,2 used instead of wys, and where wy-1,2 is the solution to a problem similar 
to (32.49) with zy_1 replaced by zy. 1. That is, 


X(N) = 7(N)@a(N) 


Now from the definition (32.52) for A(N), and using the fact that 


Ay = | s : | 


we obtain 


A(N) 


ii 


Lara r1 21 8 || er m) 


Az ,AN-AZN-a + a*(N)B(N) — Oy i Ai Ën- + o (N)hN)ww 
AzN LÀN-AZN-1 + a*(N)G(N) a Azty_, An-1Hn-10n,z 


il 


Moreover, the RLS recursion (30.13) allows us to relate wy,; and wWy-1,z as 
us = wn- + Phy [B/A] 
Substituting into the expression for A(N), and using 
A37 (N) Phy = Pu ARN 
we obtain, after grouping terms, 


A(N) = AA(N — 1) + &*(N)8(N)/5(N) (32.56) 


This is a useful relation and it plays an important role in the derivation of order-recursive 533 


algorithms (see Part X (Lattice Filters)). The relation holds for generic regularized least- SECTION 32.3 
ERO EE à TIME-UPDATE 
squares problems and there are no structural restrictions imposed on the data matrices RELATION 
{Hy-1, Hn}. 
Lemma 32.3 (Inner product time-update) Consider the data matrix (32.50), 
namely, 


Let Zu, and Zw denote the residual vectors that result from projecting 
zN-; onto R(Hy_1) and zy onto R(Hy); both projections are meant in 
the regularized least-squares senses (32.49) and (32.53). Then the weighted 
inner products 


A(N) = XN ANZN; A(N —1) = zy ,AN-1ZN-1 
are related via the time-update relation 
A(N) = AA(N — 1) + 8 (N)B(N)/(N) 


where (&(N), B(N)} are the a posteriori errors in estimating the entries 


{a( N), B(N)), as defined by (32.54), and 5(N) is the conversion factor 
defined by (32.55). 


Summary and Notes 


Wie chapters in this part describe several basic concepts of least-squares and recursive 
least-squares theory. The main ideas and results are the following. 


SUMMARY OF MAIN RESULTS 


Least-Squares 
1. The standard least-squares criterion seeks the vector Ẹ that is closest to a vector y in the 
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column span of a matrix H. It does so by solving 
min |y — Hw] 
w 


and by setting ¥ = HW, where @ is any solution to the normal equations H* Hij = H*y. 


. The normal equations are always consistent and they may even have infinitely many solutions. 


However, regardless of which solution @ we pick, the projection j is unique. 


. Least-squares solutions are characterized by a fundamental orthogonality principle, which 


states that © is a solution if, and only if, the residual vector y = y — H i? is orthogonal to 
R(H). 


. The norms of the vectors (y, 7,7} satisfy the relation ||y||? = |F]? + ||y]|?, which is an 


extension to higher dimensions of the famed Pythagorean theorem regarding the sum of the 
squares of the sides of a right-angle triangle. 


. The matrix that transforms y into 7 is called the projection matrix and, when H has full- 


column rank, it is given by Pg = H(H*H) |H*. 


. Often, in practice, we resort to regularized and weighted least-squares problems, which seek 


the vector @ that solves 
min ((u — 2)I(w - 4) + (y - Hu) W(y - Hw) 
where II > 0, W > 0 and w is an initial condition (usually zero). The solution is given by 
Q0 = 0 + (I+ H'WH) H'W(y- Hài) 


The positive-definiteness of II, and the nonnegative definiteness of W, guarantee that the 
coefficient matrix (II + H*W H) is positive-definite and, hence, invertible. 


. The orthogonality condition of standard least-squares problem becomes H*Wy = (0 — w) 


in the weighted regularized case. For ease of presentation, we continue to refer to 7 = H@ 
as the projection of y onto R(H) although, of course, this qualification is not accurate. 


. Regularized and weighted least-squares solutions can be order-updated in a rather elegant 


fashion. The derivation in the text considers both forward and backward order-updates, 
whereby the data matrix H is modified by adding a column to its left or to its right. The 
key conclusion is that order-updating least-squares solutions involves two steps: 


(i) We first project the additional column onto the original data matrix by solving the rele- 
vant regularized and weighted least-squares problem. 


(ii) We then update the original least-squares solution by using the results of this projection. 


Such order-updates will play a key role in the derivation of so-called lattice filters later in Part 
X. 


9. There is a fundamental equivalence between deterministic least-squares estimation and linear 
least-mean-squares estimation. These estimation procedures are equivalent in the sense that 
solving a problem from one class also solves a problem from the other class and vice-versa. 
The details are spelled out in Sec. 31.1. 


Recursive Least-Squares 


1. The recursive least-squares (RLS) algorithm is a recursive procedure that relates the weight 
vector solution at iteration i to the weight vector solution at iteration 7 — 1 of a regularized 
least-squares problem. 


2. Usually, RLS is used with exponential weighting in order to incorporate a forgetting mecha- 
nism into the operation of the filter; the forgetting factor gives less weight to data in the remote 
past and more weight to more recent data. 


3. In Prob. VIL33 we explain how to downdate (as opposed to update) least-squares solutions. 
Then in Prob. VII.34 we combine the update and downdate procedures to derive a sliding 
window RLS algorithm, whereby the successive least-squares solutions are based on a fixed 
length of data rather than on growing-memory data. 


4. In Prob. VII.36 we derive a block version of RLS, whereby the observation data and the 
regression data are allowed to be vectors and matrices, respectively (rather than just scalars 
and rows). The resulting filter equations are similar in structure to classical RLS with the 
conversion factor now becoming a matrix quantity. 


5. In Chapter 31 we exploit the equivalence result of Sec. 31.1 between least-squares and least- 
mean-squares estimation problems and use it to clarify the relationship that exists between 
Kalman filtering and RLS algorithms. This relationship is useful for at least two reasons: 


(i) It allows us to share ideas and algorithms between both domains. 
(ii) It allows us to develop extended variants of RLS with enhanced tracking abilities. 


BIBLIOGRAPHIC NOTES 
Least-Squares 


Least-squares and Gauss. The standard least-squares problem of Sec. 29.1 has had an interest- 
ing and controversial history since its inception in the late 1700s. It was formulated by Gauss in 1795 
at the age of 18 — see Gauss (1809). At that time, there was interest in a claim by a philosopher 
named Hegel who claimed that he has proven, using pure logic, that there were exactly seven planets. 
Then on Jan. 1, 1801, an astronomer discovered a moving object in the constellation of Aries, and 
the location of this celestial body was observed for 41 days before suddenly dropping out of sight. 
Gauss’ contemporaries sought his help in predicting the location of the heavenly body so that they 
could ascertain whether it was a planet or a comet (see Hall (1970) for an account of this story). 
With measurements available from the earlier sightings, Gauss formulated and solved a least-squares 
problem that could predict the location of the body (which turned out to be the planetoid Ceres). Ac- 
tually, Gauss went further and formulated the recursive-least-squares solution that we shall describe 
in Chapter 30; this step helped him save the trouble of having to solve a least-squares problem afresh 
every time a new measurement became available. For some reason, Gauss did not bother to publish 
his least-squares solution, and controversy erupted in 1805 when Legendre published a book where 
he independently invented the least-squares method — see Legendre (1805,1810). Since then, the 
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controversy has been settled and credit is nowadays given to Gauss as the sole inventor of the method 
of least-squares. 
Here is how Gauss himself motivated the least-squares problem: 


"... if several quantities depending on the same unknown have been determined by inexact 
observations, we can recover the unknown either from one of the observations or from any of 
an infinite number of combinations of the observations. Although the value of an unknown 
determined in this way is always subject to error, there will be less error in some combinations 
than in others.... One of the most important problems in the application of mathematics to 
the natural sciences is to choose the best of these many combinations, i.e., the combination 
that yields values of the unknowns that are least subject to the errors." 

Extracted from Stewart (1995, pp. 31,33). 


Gauss' choice of the "best" combination was the one that minimizes the least-squares criterion! 


Iterative reweighted least-squares algorithm. The least-squares solution (and, more specifi- 
cally, its weighted version) is useful in solving non-quadratic optimization problems of the form 


N-i 
min p2 |d(i) — we? 


i=0 


for some positive exponent p (usually 1 < p < 2). This can be seen by reformulating the criterion 
as a weighted least-squares criterion in the following manner. Define the scalars (assumed nonzero): 


o(i Ê |d(i) -wz[^?, i =0,1,...,N-1 


and introduce the diagonal weighting matrix W = diag{a(0),a(1),...,a(N — 1)). Then the 
above optimization problem can be rewritten in the form 


N-1 
min y - Hz)'W(y - Hz 
i b» ( ) 
where y = col{d(0),d(1),...,d(N — 1)) and H = col{uo,ui,...,un-1}. Of course, this 
reformulation is not truly a weighted least-squares problem of the same form studied in Sec. 29.6 
since W is dependent on the unknown vector z. Still, this rewriting of the cost function suggests the 
following iterative technique for solving it, and which has been proven useful in practice. Given an 
estimate Z;,_1 at iteration k — 1 we do the following: 


compute o(i) = |d(i) uy i| ?, i=0,1,...,N—1 
set Wi = diag(ox(0), o4 (1),..., ox(N — 1)) 

compute the new estimate as 3, = (H'W,H) | H'Wyy 

and repeat 


This implementation assumes that the successive W% are invertible, and the algorithm is known as 
the iterative reweighted least-squares (IRLS) algorithm. There are several variations of IRLS with 
improved stability and convergence performance relative to the above implementation (see, e.g., 
Osborne (1985) and Bjórck (1996). See also Fletcher, Grant, and Hebden (1971) and Kahng (1972)). 
One such variation is to evaluate 7; not directly as above but as a convex combination in terms of 
the prior iterate 2,1 as follows: 


compute œx(i) = |d(i) Ów£yi|[P ?,  i—0,1,...,N —1 
set Wy = diag{axz(0),ax(1),...,an(N — 1)} 

set Zk = (H'W,H) | H'Wyy 

set Z4 = o + (1 — o)fx-1 


and repeat 


for some 0 <a <1. 


Maximum likelihood estimation. There is a fundamental connection between least-squares esti- 
mation for deterministic quantities and maximum likelihood (ML) and maximum a posteriori (MAP) 
estimation for jointly Gaussian random variables. Maximum likelihood estimation was perhaps the 
most significant contribution to estimation theory following Gauss' own formulation of the least- 
squares problem in 1795. It was introduced by Fisher (1912) more than 100 years after Gauss' work. 
In order to explain the connection between ML and least-squares, we shall briefly review the ML and 
MAP formulations. 

Let y be a random variable with a probability density function f, (y; z), which is parameterized 
in terms of some unknown constant parameter x. The maximum likelihood estimate of z given 
a realization y of y is obtained by solving (see, e.g., Van Trees (1968), Scharf (1991), and Kay 
(1993): 


$ = arg [max fu (y; 2) 


The function f, (y; z) is called the likelihood function, and its logarithm is called the log-likelihood 
function, L(x, y) = In fy (y; 2). 

The concept of maximum likelihood estimation can be extended to the case where the parameter 
is a random quantity, x, and not a constant, x. In this case, we would maximize the joint probability 
density function of æ and y with y fixed at the value of its realization: 


T = arg [max fou (,y)| 


However, in view of Bayes’ rule, which states that fe,y (£, yY) = fely (zly) fy (y). we see that maxi- 
mizing fx,y (z, y) over z is equivalent to maximizing fz, (zly) over x, i.e., 


$ = arg [max fei cly)] 


This latter formulation is known as maximum a posteriori estimation (MAP) and it maximizes the 
conditional pdf of a given y. 

Now assume that y = Hw-+-v, where w is an unknown constant vector and where v has a circular 
Gaussian distribution with zero mean and identity covariance matrix, written as v ~ A/(0, I). As- 
sume also that y is N —dimensional. Then y will be normally distributed with mean Hw and identity 
covariance matrix as well. The probability density function of y will be given by (cf. Lemma A.1): 


1 


fy(y;w) = 7N expl Hel Huy} 


and it is parameterized in terms of w. It then follows that the maximum likelihood estimate of w is 
the value that solves 
min |y — Hw|? 
w 


which is the least-squares solution. In other words, the ML estimate of w based on a realization of 
y = Hw + v, is equal to the least-squares solution of min||y — Hw|?. 
aU 


In a similar vein, we can provide a statistical interpretation for the regularized least-squares prob- 
lem as follows. Assume now that 


y-—Hw-v, with v ~ N(O,W7') and w ~ A(0,II!) 


In other words, the unknown w is now modeled as a random variable. Then we already know from 
the discussion in Sec. 2.2 that the optimal estimator of w given y is 


$ = E(wly) = RuyRyy 


which is also equal to the MAP estimator of w given y. We further know from the calculations in 
Sec. 5.1 that the expression for w evaluates to 


$ = (11+ H*WH)'H*Wy 
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which, as we see, has the form of a weighted regularized least-squares solution. 


Equivalence. Actually, deterministic and stochastic least-squares problems can be related more 
directly than above and without the need for the Gaussian assumption on the data and noise. In 
Sec. 31.1 we show that there is an intimate relation between regularized least-squares problems, 
as studied in this chapter, and linear least-mean-squares estimation problems, as studied in Part II 
(Linear Estimation). Although the former class of problems deals with deterministic variables while 
the latter class of problems deals with random variables, both classes turn out to be equivalent in the 
sense that solving a problem from one class also solves a problem from the other class and vice-versa. 
Among other results, such equivalence statements allow us, for example, to relate Kalman filtering 
(which deals with stochastic estimation — cf. Chapter 7) to adaptive RLS filtering (which deals with 
deterministic estimation) — see Sec. 31.2 and Sayed and Kailath (1994b). 


Reliable numerical methods. There is a huge literature on least-squares problems and on reliable 
numerical methods for their solution — see, e.g., Higham (1996), Lawson and Hanson (1995), and 
the detailed treatment by Bjórck (1996). Among the most reliable methods for solving least-squares 
problems is the QR method, which is described in Prob. VII.12 (and which is studied further in 
Prob. VII.32 and also in Sec. 35.2 in the context of adaptive least-squares filtering). The origin of the 
QR method goes back to Householder (1953, pp. 72-73), followed by Golub (1965) and Businger 
and Golub (1965). Since then, there has been an explosion of interest on solution methods for least- 
squares problems. 


An application to OFDM communications. In a computer project at the end of this part we 
illustrate how least-squares (and also least-mean-squares) solutions are useful in the design of or- 
thogonal frequency-division multiplexing (OFDM) systems. OFDM is a multi-carrier modulation 
scheme (MCM) that attempts to achieve the theoretical capacity of a channel. The principle of MCM 
is to divide the channel into multiple subchannels over which data distortion is negligible, and to 
divide the data stream into parallel substreams that are used to modulate several carriers. The data 
rates of the substreams are adjusted in accordance with the spectral properties of the noise over the 
subchannels. In this way, MCM, and similarly OFDM, can achieve high data rates even over hostile 
channels. The first works on MCM systems were done by the military in the 1950's, and an early 
reference on OFDM is the patent by Chang (1970); filed a few years earlier in 1966 and issued in 
1970. For further details see, e.g., the books by Bahai and Saltzberg (1999), van Nee and Prasad 
(2000), and Terry and Heiskala (2002). The works by Tarighat and Sayed (2003,2005) provides fur- 
ther illustration of the use of linear least-squares and least-mean-squares techniques in the design of 
OFDM receivers for both SISO and MIMO channels. 


Recursive Least-Squares 


RLS and Gauss. Gauss was also motivated to develop the recursive least-squares algorithm in 
his work on celestial bodies (ca. 1795). Of course, Gauss' notation and derivation were reminiscent 
of late 18th century mathematics and, therefore, they do not bear much resemblance with the linear 
algebraic and matrix arguments used in this chapter; see, e.g., the useful translation of Gauss' original 
work that appears in Stewart (1995). In modern times, the original work on the RLS algorithm is 
often credited to Plackett (1950,1972). 


Regularization. Regularization has a rich history in numerical analysis and matrix computation 
methods (see, e.g., Golub and Van Loan (1996)). One of the earliest references on regularization 
seems to be Tikhonov (1963). In the context of RLS filtering, an examination of the effect of regu- 
larization on RLS convergence appears in Moustakides (1997). 


Sliding window RLS. A finite-memory RLS algorithm is derived in Prob. VIL34. An efficient 
implementation as the combination of two pre-windowed RLS solutions is presented in Manolakis, 
Ling, and Proakis (1987). 


Numerical issues. The exponentially-weighted RLS algorithm of Sec. 30.6 can face numerical 
difficulties in finite word-length implementations. Such difficulties arise mainly from two sources 


of inaccuracies: loss of Hermitian symmetry and loss of positive-definiteness of P; due to round- 
off error accumulation. The loss of symmetry problem was observed in Verhaegen (1989a), and 
it was suggested by Verhaegen (1989b) and Yang (1994) that this problem could be alleviated by 
propagating only the lower or upper triangular parts of P;. In Part VIII (Array Algorithms), we 
shall describe other so-called array implementations of RLS; these array methods are more robust to 
numerical difficulties and they are also more reliable than plain RLS implementations. 


Filter divergence. Since the conversion factor satisfies 0 < y(i) < 1, it can be interpreted as an 
angle variable (more precisely, as the cosine of an angle) — see Lee, Morf, and Friedlander (1981). 
In this way, monitoring its value during filter operation can provide useful feedback on the proper 
operation of the filter. It was proposed by Bottomley and Alexander (1989) that monitoring of the 
conversion factor is a reasonable mechanism for detecting filter divergence — see Sec. 39.4. 


Error propagation. In the adaptive filtering literature, the numerical stability of RLS is usually 
studied by resorting to a single-error propagation model. In other words, a disturbance is assumed 
to be introduced at some iteration, with no additional disturbances occurring afterwards, and then 
the evolution of this single error is studied (an example to this effect appears later in Prob. VIII.11). 
An early study along these lines is that of Ljung and Ljung (1985), where it was shown that RLS 
is numerically reliable for A < 1 and diverges for A = 1. However, conclusions based on such 
simplified error analyses can be misleading. More elaborate models for error propagation in RLS are 
needed, and progress in this direction appears in the work of Liavas and Regalia (1999). Actually, 
there have also been many important works in the numerical linear algebra community, most notably 
by Paige (1979ab, 1985) and Paige and Saunders (1985), on stabilized least-squares methods, which 
are relevant to the adaptive filtering context. 


RLS and Kalman filtering. There is a fundamental connection between RLS and Kalman filtering, 
so much so that solving a problem in one domain amounts to solving a problem in the other domain. 
The details of this equivalence were developed by Sayed and Kailath (1993,1994b) and they are 
discussed in Sec. 31.2. The relationship between RLS and Kalman filtering is useful for at least two 
reasons: 

1. It becomes possible to share ideas and algorithms between both domains. 


2. Since RLS is equivalent not to a full-blown Kalman filter, but only to a special case of it, it 
becomes possible to develop extended RLS schemes with enhanced tracking abilities as well 
as block RLS algorithms (see Sec. 31.A and also Haykin et al. (1997) for additional examples). 


One of the earliest mentions of a relation between least-squares and Kalman filtering seems to be 
Ho (1963); however, this reference considers only a special estimation problem where the successive 
regression vectors are identical. Later references are Sorenson (1966) and Astróm and Wittenmark 
(1971); these works focus only on the standard (i.e., unregularized) least-squares problem, in which 
case an exact relationship between least-squares and Kalman filtering does not actually exist, espe- 
cially during the initial stages of adaptation when the least-squares problem is under-determined. 
Soon afterwards, in work on channel equalization, Godard (1974) rephrased the growing-memory 
(i.e, A = 1) RLS problem in a stochastic state-space framework, with the unknown state correspond- 
ing to the unknown weight vector. Similar constructions also appeared in Willsky (1979, pp. 38-41), 
Anderson and Moore (1979, pp. 135-136), Ljung (1987, p. 309), Strobach (1990, pp. 331—335), and 
Sóderstróm (1994, pp. 145—146). In the works by Anderson and Moore (1979), Ljung (1987), and 
Söderström (1994), the underlying models went a step further and incorporated the case of exponen- 
tially decaying memory (i.e., A < 1) by formulating state-space models with a time-variant noise 
variance (as described in (31.40)). Nevertheless, annoying discrepancies persisted that precluded a 
direct correspondence between the RLS and the Kalman variables. Some of these discrepancies were 
overcome essentially by fiat (see, e.g., the treatment by Haykin (1991, pp. 502-504)). This lack of a 
direct correspondence may have inhibited application of the extensive body of Kalman filter results 
to the adaptive filtering problem until the work of Sayed and Kailath (1993,1994b). 

In retrospect, by a simple device, Sayed and Kailath (1994b) were able to obtain a perfectly 
matched state-space model for the case of exponentially decaying memory, with a direct correspon- 
dence between the variables in the exponentially weighted RLS problem and the variables in the 
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state-space estimation problem. The main benefit of this result is that recursive state-space estima- 
tion problems have been extensively studied since the early sixties. Besides the celebrated Riccati- 
equation-based Kalman filtering algorithm (cf. Chapter 7), many algorithmic and implementation 
alternatives have been studied over the years (see Apps. 35.A and 37.A and, for more details, see 
the textbook by Kailath, Sayed, and Hassibi (2000)). These include the so-called information filter 
forms and for certain kinds of time-variant state-space models (including those encountered in adap- 
tive filtering), the Riccati recursions can be replaced by the order-of-magnitude faster Chandrasekhar 
recursions; moreover, all these variants have certain computationally better square-root (or array) 
forms. The interesting fact then is that when the exponentially-weighted RLS filtering problem is 
reformulated in state-space form, the Kalman filtering solutions turn out to be equivalent to the var- 
ious classes of RLS adaptive filtering algorithms that are going to be introduced in future chapters. 
The details of this equivalence are spelled out in Chapter 31, in the reference by Sayed and Kailath 
(1994b), in Apps. 35.A and 37.A, as well as in Probs. VIII.12, VIII.13, and IX.16. 


Problems and Computer Projects 


PROBLEMS 


Problem VII.1 (Rank of a matrix) Consider any N x M matrix H. Show that its row rank is 
equal to its column rank. That is, show that the number of independent columns is always equal to 
the number of independent rows (for any N and M), and hence, we can simply talk about the rank 
of a matrix. 


Problem VII.2 (Frobenius norm of a matrix) Given an N x M matrix A of rank r, let A = 
X; Titi; denote its singular value decomposition (cf. Sec. B.6). Show that 


lAls = VTA = Yo? 


Problem VII.3 (Projecting onto the orthogonal complement space) Consider an N x M 
full-rank matrix H with N > M, and two column vectors y and z of dimensions N x 1 each. Let 
y = Pry and Z = Pjz. Are the residual vectors jj and Z collinear in general? If your answer is 
positive, justify it. If the answer is negative, can you give conditions on N and M for which jj and Z 
will be collinear? 


Probiem VII.4 (Orthogonal complement space) Let H be N x M with full-column rank. 
Show that any vector in the column span of Py is orthogonal to any vector in the column span of H. 
That is, show that H*Pj =0. 


Problem VII.5 (Special cases) Consider the least-squares problem minw |y — Hw||?. Com- 
ment on the solution in the following cases: (a) y € M(H), (b) y € R(H), and (c) y € N(H"). 


Problem VII.6 (Minimum-norm solution) Consider the under-determined least-squares prob- 
lem min |y — Hw||?, where y is N x 1, H is N x M, and N < M. Assume further that H has 
full rank. 
(a) Verify that many solutions @ exist. 
(b) Show that the minimum norm solution is given by © = H*(HH")~+y. Specifically, show 
first that this @ satisfies the normal equations. Then show that any other solution, say @ + r 
for some nonzero r, has Euclidean norm strictly larger than @. 


Problem VII.7 (Affine projection algorithm) Refer to the discussion in Sec. 13.1. Show that 
the affine projection algorithm (13.5) can be obtained as the solution to the following regularized 
least-squares problem min [ e||wi — wi-ill? + ||di — Urwi-1||?], for given {wi-1, di, Ui}. 


Problem VII.8 (MIMO least-squares problem) Let || A||x denote the Frobenius norm of A, i.e., 
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for an n x m matrix A. Consider an N x m matrix Y, an N x M matrix H, and an M x m matrix 
X. Show that X is a solution of minx ||Y — H X ||r if, and only if, it satisfies the normal equations 
H'HX = H*Y. 


Problem Vil.9 (Weighted least-squares) Let W > 0 be a given weighting matrix and let H 
be N x M. 


(a) Show that the normal equations, H*W Hí) = H*W y, are consistent. More specifically, 
show that R(H"W H) = R(H*W). 


(b) Show that H*W H is singular if, and only if, H* H is singular. 


(c) Show that when the normal equations H*W Hi? = H*W y have many solutions, regardless 
of which one we pick, the projection vector j = H@ remains invariant. 


Problem VII.10 (Cholesky method) Consider the normal equations H* Hi = H*y, where H 
is N x M. Assume H has full column rank (i.e., the rank of H is M) so that H* H is positive- 
definite. The normal equations can be solved by the standard method of Gaussian elimination. They 
can also be solved by appealing to the Cholesky factorization of H* H. 

We showed in Sec. B.3 that every positive-definite matrix admits a unique triangular factorization 
of the form H*H = LL", where L is lower-triangular with positive entries on its diagonal. The 
matrix L is called the Cholesky factor of H* H. Show that the normal equations can be solved by 
means of the following two steps: 


1. Solve the lower triangular system of equations L = H*y for 2. 


2. Solve the upper triangular system of equations L" i) = 2 for Ô. 


Problem VII.11 (Danger of squaring) Solving the normal equations H* H? = H"y by form- 
ing the matrix H*H (i.e., by squaring the data) is a bad idea in general. This is because for ill- 
conditioned matrices H, numerical precision is lost when the matrix product H* H is formed. Recall 
that ill-conditioned matrices are those that have a very large ratio of largest to smallest singular val- 
ues, as defined in Sec. B.6, i.e., they are close to being rank-deficient. Consider the full-rank matrix 


1 1 
H-|0 e 
1 1 


where e is a very small positive number that is of the same order of magnitude as the machine 
precision. Assuming 2 + €? = 2 in finite precision, what is the rank of H" H? 


Problem VII.12 (QR method) Consider the same setting of Prob. VIL10. A method to reduce 
the effects of ill-conditioning of H on the solution of the normal equations is to avoid forming the 
product H* H and to determine the Cholesky factor L by working directly with H. This can be 
achieved by appealing to the so-called QR decomposition of H, as explained in Sec. B.5, namely, 


[t 


where Q is N x N unitary and R is M x M upper-triangular with positive diagonal entries. 

(a) Show that L = R*. 
Remark. With the L so determined, we can proceed to solve the normal equations by using the two-step 
procedure of Prob. VII.10. Alternatively, we can proceed as below, which is nowadays the preferred way 
of solving the normal equations due to its numerical reliability. 

(b) Let col{z1, 22) = Q"y, where z1 is M x 1. Verify that ||y — Hwll? = ||z1 — Rw|? + |zall?. 
Conclude that the least-squares solution @ can be obtained by solving the triangular linear 
system of equations RÔ = z1. Conclude further that the resulting minimum cost is || z2||?. 


Problem VII.13 (Order-updating P via backward projection) Refer to the geometric argu- 
ment in Sec. 32.1 and, in particular, to relation (32.35) between [H h] and [H. A). 


(a) Use the orthogonality condition (32.36) to justify the relation 


Al/2y* 0 Aly —-A7/2U*H* Wh zh 
0 oN? | = 0 ol? E i | 
H h H À A 

C 
A B 


which we write as A = BC. 
(b) Show that 


zy 
a]. t Bs aps. p|? oU ge [P 9. 
0 W 0 w 0 oth*Wh 


(c) Use the equality A* (IG W)A = C*B'(IG W)BC to re-establish relation (32.20) between 
{P, Pa}. 


Problem Vil.14 (Order-update of weight vector via backward projection) Refer again to 
the geometric argument in Sec. 32.1 and, in particular, to relation (32.35) between [H h] and[H ^]. 


(a) Argue that determining the vector @, that solves (32.7) is equivalent to determining the vector 
(D, that solves 


2 


0 AM?U*  -A-V?U* H* WR 
min ol- 0 gi? Ws 
^ y H h 


(lew) 


eb 
where Ôs = | : T | wz and à? is the solution of (32.17). 


(b) Solve the problem of part (a) and re-derive (32.22). 


Problem VII.15 (Forward projection) Refer to the discussion in Sec. 32.2 and, in particular, to 
the definition of P, in (32.40). Let P^! = II + H*'WH. 


(a) Establish the validity of the matrix identity: 


$3] -[i A] 


(b) Use the identity of part (a) to show that 


ns s] tara | [lt nn] 


where (i07, €,} are as defined by (32.41) and (32.42). 


I 
Dc 


Ja eor" [1 -BD | 


* 


(c) Multiply both sides of the equality of part (b) from the right by | s | Wy and show that 


w —w 


where x = h*Wy/(o + ên). 
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544 Problem VII.16 (Order-updating P via forward projection) Refer again to the discussion in 
Part Vil Sec. 32.2. 


PROBLEMS (a) Verify that 
[c 1 0 
[n] - [n n] [as 1] 
(b) Use the orthogonality condition H* Wh = I to justify the relation 
gi? 0 ci? 0 
0 Al/2yu* e: ~A72U* H* Wh Al/2y* 
h H h H 


A B 


which we write as A = BC. 
(c) As in parts (b) and (c) of Prob. VII.13, use the equality A = BC to arrive at 


0 0 1 1 
P, = pod tn gr 
E 2] + steel v ar | 


Problem VII.17 (Order-update of weight vector via forward projection) Refer again to the 
discussion in Sec. 32.2. 


(a) Argue that determining the vector ©, that solves (32.39) is equivalent to determining the 
vector 42, that solves 


2 


0 gi? 0 
min 0|-]| -A"?u*H*Wh AM?U* | ws 
. y À H 


(ew) 


(b) Solve the problem of part (a) and establish that 


~ lo 1 


where x = h’ Wy/(o + En). 


Problem VII.18 (Regularized backward projection) Refer to the discussion in Sec. 32.1 and 
assume that we replace the regularization matrix in (32.7) by the more general choice 


EX 


for some row vector m and positive scalar c. 
(a) Verify that (32.20) becomes 


where q = P(H*Wh + m), ( =o + £y — m*q — ("" m, and (^ is still the solution to the 
backward projection problem (32.17). Verify further that g = @° + t, where t = Pm. 
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(b) Define h 2h- Hq. Show that (32.22) is replaced by 


where x = h*Wy/C. 
(c) Likewise, show that 9, = F+ sh and jj, = ¥—«h, while£, = £— |p|? /C where p = y"Wh. 
(d) Finally, let à denote the last entry of h. Show that yz = y — la|?/C. 


Problem VII.19 (Stochastic properties of least-squares solutions) Let y = Hw’ + v, 
where w° is an unknown vector that we wish to estimate and v is a zero-mean random vector with 
covariance matrix Evv* = c?I. Moreover, H is N x M, N > M, and has full rank. If y isa 
realization of y, then the least-squares estimate of w° given y is © = (H" H) ^! H*y. The resulting 
residual vector is ¥ = y — HW. 

In order to study the stochastic properties of this least-squares solution, we need to treat it as an 
estimator and write instead «b = (H"H)-! H*y, in terms of the random quantity y. We also write 
ğ = y — Hw; different realizations for y lead to different realizations for (4b, 7}. 


(a) Verify that b. = w° + (H* H)-! H*v, and conclude that w is an unbiased estimator, i.e., 
Ew = w°. 
(b) Show that E (w? — ù)(w° — w)* = o2(H* HY. 


(c) Assume N > M and let 62 = ||g||?/(N — M) denote an estimator for the noise variance. 
Verify that E||g||? = o2Tr(I — Px), where Py is the projection matrix onto R(H). Show 
further that Tr(I — Pg) = N — M and conclude that ô? is an unbiased estimator for c2. 


Problem VII.20 (Constrained least-squares) In constrained least-squares problems we want 
to minimize ||y — Hw||? over w subject to the linear constraint Aw = b, where H and A are 
N xn (N 2 n) and M x n (M < n) full-rank matrices, respectively. Show that the solution 
is given by à, = @ — (H* H)-1A* [AUT H) 1 A'] ^ (A — b), where @ is the standard least- 
squares solution, @ = (H* H)-1 H*y. 


Problem VII.21 (QR solution of constrained least-squares) Refer to Prob. VII.20 and in- 
troduce the QR decomposition of A*, namely, 


edt 


where Q is n x n unitary and R is M x M upper triangular. Introduce the change of variables 
z = Q*w and H = HQ and partition (2, H} into 


zi B-[m At | 
22 
where zı is M x 1 and Hi is N x M. Let je = y — Miz. 
(a) Show that the constrained least-squares formulation of Prob. VII.20 is equivalent to 
min |y — Hz||? subjectto R*z; = b 


(b) Show that the minimizing solution is given by 2 = col(21, 22}, where Z; = R^ "b while 22 
is determined by solving the least-squares problem min ||ja — H2z2|l?. 
z2 


Problem VIi.22 (Gauss-Markov theorem) Recall the statement of Thm. 6.1. Given a random 
vector y that is linearly related to an unknown vector w via y = Hw + v, where v is zero- 
mean noise with covariance matrix R,, the minimum-variance unbiased linear estimator of w is 
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$ = (H* R7 LH) | H" R; |y or, equivalently, the estimate of w given an observation y is @ = 
(H* R; I H)^! H* R; y. Verify that this expression can be interpreted as the solution to the weighted 
least-squares problem min (y — Hw)* R; (y — Hw). 

Remark. It follows that the Gauss-Markov theorem suggests that an optimal choice for the weighting matrix W 
in (29.21) is W = RJ}, i.e., the inverse of the covariance matrix of the noise component in y. 


Problem ViI.23 (Separating signal from perturbation) Consider the linear model y = Hz-4- 
S0 + v, where v is a zero-mean additive noise with unit covariance matrix, and (2, 0) are unknown 
constant vectors. The matrices H and S are N x n and N x m, respectively, and such that [H S] 
has full rank with N > m + n. The term S6 can be interpreted as a structured perturbation that 
lies in the column span of S, while s = Hz denotes the desired signal that is corrupted by 560 and 
by v. We wish to estimate Hz from y and, hence, separate the signal component, Hz, from the 
perturbation, S0. 


(a) Let z = col(z,0). Determine the minimum variance unbiased estimator, 2, of z given y. 
Hint: Recall Thm. 6.1. 


(b) Let 2 = col($, 8) and consider the estimator 8 = Hz for s. Show that 
8 = Pu [I- s(S*Pas) S" P] v 


with Ph = I — Pu, Px = I — Ps and where Ps and Py denote the orthogonal pro- 
jectors onto the column spans of S and H, respectively, i.e, Ps = S(S"S) !S" and 
Pg = H(H'H) H*. 

(c) Assume instead that z is modeled as a zero-mean random variable with covariance matrix 
H^! = Egg” > 0. Show that the linear least-mean-squares estimator of s = Hz is now 
given by ê = Pf [L- S(S"Pz^ S) S" PE | y, where P+ = I- Ph and PẸ = 


H(II + H* H)*1 H*. Compute the resulting minimum mean-square error (m.m.s.e.) and 
compare it with that of part (b). 


Probiem VII.24 (Property of regularized solutions) Let © denote the unique solution of the 
regularized least-squares problem 


min [yhwl? + ly ~ Hu] 


where H is N x n and has full rank, with N > n, and y does not lie in the column span of H. 
Assume further that H*y 4 0 so that © is nonzero. Moreover, + is a finite positive number. Let 
n? = Yl? /ly — HP. Show that v? < |H*vl/lyfP?. 


Remark. This result can be used to show that, under the given conditions on the data (H, y), the nonzero solution 
of a regularized least-squares problem is also the solution to a robust estimation problem of the form studied in 
Probs. V1I.25- VII.26. For more details on such robust formulations, see Sayed, Nascimento, and Chandrasekaran 
(1998). Further extensions and applications appear in Sayed and Chandrasekaran (2000), Sayed (2001), Sayed, 
Nascimento, and Cipparrone (2002), and Sayed and Chen (2002). 


Problem ViI.25 (Zero solution of a robust problem) Consider an N x n full rank matrix H 
with N > n, and an N x 1 vector y that does not belong to the column span of H. Let 7 be a positive 
real number and consider the set of all matrices 5H whose norm does not exceed n, || H || € n. Here, 
by the norm of a matrix we mean its maximum singular value (see Sec. B.6). For the purposes of this 
problem, all you need to know about this norm is that it satisfies the property ||Hz|| € ||H1| |[z|| for 
any vector x, where ||x|| denotes the Euclidean norm of x. Now consider the following optimization 
problem 
min max -(H+6H)w 
n: ax. oA )w|| 
That is, we seek a vector w that minimizes the maximum residual over the set {||6H|] € n}. 


(a) Argue from the conditions of the problem that we must have N > n. 


(b) Show that the uncertainty set (|| H || € 7} contains a perturbation 6H° such that y is orthog- 
onal to (H + 6H?) if, and only if, n > ||H*y||/ ||]. 


(c) Show that the above optimization problem has a unique solution at ®© = 0 if, and only if, the 
condition on 7 in part (b) holds. 


Problem VII.26 (Nonzero solution of a robust problem) Consider an N x n full rank matrix 
H with N > n, and an N x 1 vector y that does not belong to the column span of H. 


(a) For any nonzero n x 1 column vector w, show that the following rank-one modification of H 
(denoted by H (w)), 


A 


nw) à [nes rt 2] 


Ht n WT 
| "Aw — vll wj 
still has full rank for any positive real number n. 


(b) Verify that ||y — H (w)wl| = ||y— H wll 4- n]]w||, and that the vectors y— H(w)w and y — Hw 


are collinear and point in the same direction (that is, one is a positive multiple of the other). 
Show that ||y — H(w)w|| = maxs i< lly — (H --óH)w|. 

Show that the optimization problem miny maxijsgj«s ||y — (H + 6H)w|| has a nonzero 
solution @ if, and only if, 7 < ||H" yl|/l|yll. 


= 


(c 
(d 


— YS 


Show that @ is a nonzero solution of the optimization problem in part (d) if, and only, if 
H*(@)[y — Hà] = 0. That is, the residual vector y — H@ should be orthogonal to the per- 
turbed matrix H(@). Show further that this condition is equivalent to H*(®)|y — H(®@)@] 
0. 

(f) Assume two nonzero solutions #1 and #2 exist that satisfy the orthogonality condition of part 
(e). Argue that H*(@2)[y — H(@2)@1] = 0, and conclude that @1 = i2 so that the solution 
is unique. 


(e 


— 


Problem VII.27 (Circulant matrices and DFT transformations) Refer to the discussion on 
OFDM receivers in Computer Project VII.1. Use the fact that the DFT matrix F is symmetric 
to verify that A = F*H TF, Conclude from this relation that (VII.2) should hold. 


Problem VII.28 (Inner product equality) Consider a data matrix H and partition it as H = 
| y H z |: with y and z denoting its leading and trailing columns, respectively. Let and Z 
denote the regularized least-squares estimates of y and z given H, namely, 7 = H2,, 2 = Fêz, 
J= y —Y, and Z = z — Z, where @, and Fz are the solutions of 


min (zjIlz,--|y — Hz,|?) and min (xzIz. + ||z — Hz.|^) 
Ty Le 


Show that *z = y'2. Define € g*z/(|gl| ||Z||). Show that |x] < 1. 


Problem VII.29 (Minimum cost) Refer to the derivation in Sec. 30.5 for the time-update of the 
minimum cost. Assume instead that we start from a regularized cost function of the form: 


min [(w—o)"T(w— 9) + lyn- Hwwll^] 


with a nonzero Ù. Its minimum cost is given by £(N) = (yn — Hn ®)* (yn — HN ww) — see Table 
10.3. Repeat the derivation of that section to show that £(UN) still satisfies the recursion £(N) = 
E(N — 1) + e(N)r" (N), with £(—1) = 0 and where e(N) = d(N) — uwww-1 and r(N) = 
d(N) — UNUWN. 


Problem VII.30 (Modified RLS) Refer again to the discussion in Sec. 30.6 on the exponentially 
weighted least-squares problem but assume now that yn evolves in time in the following manner: 


— ayn-1 
w=] d(N) | 
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for some scalar a. The choice a = 1 reduces to the situation studied in Sec. 30.6. Repeat the 
arguments prior to the statement of Alg. 30.1 and show that the solution wy, and the corresponding 
minimum cost, £(N), can be computed recursively as follows. Start with w_1 = 0, P; = II^, 
and £(—1) = 0, and iterate for i > 0: 


yG) = i1/u-AuB.iu) 
gi = AP.) 

e(i = d(i)-auiwi.i 
wi = awi-1+gie(i) 
B, = A*P-1- gigi /y(i) 


E = a£ - 1) + rOle 
In particular, observe that the scalar a appears in the expressions for (wi, e(i), £(i)). Show further 


that r(i) = y(i)e(i) where r(i) = d(i) — uwi. 


Problem VII.31 (Inner-product update) Refer to the discussion in Sec. 32.3 and assume the 
extended matrix Hy in (32.50) is defined instead as follows 


A QXN-1 Hn-1 ZN-1 A = 
RO ee hy eR 2 [ev Hw zw | 


with a scalar a multiplying zv_1. Repeat the derivation of that section to show that now 
A(N) = Aa* A(N — 1) +&"*(N)B(N)/F(N) 
where (J(N), &(N)) are defined as before, while &a(N) = a(N) — ahnwn-1,2. 


Problem VII.32 (QR method for updating LS solutions) Refer to the discussion and notation 
in Sec. 30.2, and consider the two standard least-squares problems: 


YN-1 An-1 
Seo 


whose solutions we denote by wy—1 and wy, respectively. Assume the data matrices {Hn -1, Hn} 
have full-column ranks so that {wn—1, ww) are uniquely defined. Suppose that at step N — 1 we 
have solved the first problem via the QR method of Prob. VII.12 and obtained wy -1. Let 


2 


min |yv-i— Hw-iw|]? and min 
w w 


Hn-1 = Quai | e | 


denote the QR factorization of Hy-1, and define Qv. iyw-i = col[zi,v 1, 22,1) where 


z1,N-1 is M x 1. Let also 
RN- R 
UN 0 


denote the QR factorization of col( R1, uv}. 
(a) Verify that 


Qn-1 0 i X2 21,N-1 Rn-1 
[^ iem E = [aes JU] 


(b) Define the entries {21, N, 22,N ) as follows: 


ZLN ^ p*| ALN i 
pd ve | aN) | 


2 


+ jo, Ni] 


Show that wy = Ry zi,v. Show also that the minimum cost £(N) at step N satisfies 
E(N) = E(N ~ 1) + [z2x l? 


Remark. In summary, we find that {Ry , 21, N , Z2, N } are updated via the equation 


aaa 

zin- d'(N) ALNOOURN 

We shall re-derive this QR method in Sec. 35.2 via a different route and, actually, for more general regularized 
least-squares problems. At that time, we shall also provide further insight into the meaning of the variables 


IGNES AE 


Problem VII.33 (Downdating least-squares solutions) We use the notation of Sec. 30.2. In 
this problem, we consider a new issue, namely, that of removing the effect of earlier data. Thus 
consider again the data (yw, Hw) from (30.5) and partition them instead as follows: 


d(0) uo 
UN — ; Hy = 
| YN | | Huv 
where the notation y1.w denotes the last entries of yx (i.e., from time 1 to time N). Likewise, Hi: 
denotes the last rows of Hy. The subscript notation 1.x indicates that Hi: consists of rows 1 
through N; likewise for yi:n. 


Suppose now that we wish to remove the effect of the initial data {d(0), uo}, i.e., we are interested 
in the solution of the regularized least-squares problem: 


min [ww + [yix -Hinuli ] 


We denote its solution by wi:n, Le, wi = (O + Ain Huw) |HiwyuN (cf. (30.7). The 
purpose of this problem is to relate w1:n to wy in (30.7), which solves (30.6). Let Pi; = (II + 
His Hy) 5, and recall from (30.8) that Py = (II + Hy Hn) |. The arguments that follow, for 
relating (wi, wn} and (Pi, Py}, are essentially identical to those in Sec. 30.2 while deriving 
the RLS algorithm. 


(a) Show that Pi. y = Py — PyuguoPn /(—1 + uoPy ui). 


(b) Show also that wi:v = wn + EE [d(0) — uowx]. 


(c) Show that 1 — upPxug > 0. 


Problem VII.34 (Sliding window RLS) We can combine the result of Prob. VII.33 with the re- 

cursions of Alg. 30.1 to derive a recursive least-squares solution with finite-memory. The solution 

wy in (30.7) is usually termed a growing-memory solution since it depends on all data prior to and 

including time N. A finite-memory solution can be developed as follows — see also Prob. IX.14. 
Introduce the quantities 


d(N — L 4 1) UN-L41 
4 : A $ 
YN-L41uN = : ! Hu-L«uN = : 
d(N) UN 
The vector yv — 1.-1:N contains entries over an interval of length L, namely, from time N — L +1 


up to time N. Likewise, the data matrix Hy_—1+1:n contains regressors over the same interval of 
time. Let ww r1.N denote the solution to the regularized least-squares problem: 


min [w'Ilw + |lyw-r+in — Hy-1+unwll’ ] 


The purpose of this problem is to derive an algorithm for going from ww r:N-1 to WN-—L41:N, aS 
depicted in Fig. VII.1. This can be achieved by combining the downdating recursion ww r.;N-1 — 
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WN-L:N-1 
WN-L+i:N 


Updating 


FIGURE VIi.1 A procedure for finite-memory recursive least-squares by means of updating and downdating 
least-squares solutions. 


WN-LA1:N-1 (which has a similar form to that derived in Prob. VIL33) with the updating recursion 
WN-L+1:N—1 — WN-L+1:N (which has a similar form to that in Alg. 30.1), i.e., 


downdating updating 
WN-L:N-1 = WN-L+1:N-1 =a WN-L+1:N 


Py-toun-1 È (IL Hh-rpunHn-isun) | 


Introduce 


(a) Argue as in Prob. VIL33 to show that 


PN-L:N-1UN-L ea(N — L) 


WN-L+:N-1 = WN-L:N-1 t= es 
—İ + uN-LPN-LN-iUN. EL 


d(N — L) - uN-LUN-L:N-1 
Pu-LN-iuUN-LUN-LPN-L:N-1 
—li-c ug-LPN-LN-YWN.r 


ea(N — L) 


PuN-LN-i1— 


Py-L4+1:N-1 


(b) Argue as in the steps that led to Alg. 30.1 to show that 


Py—~141:N-1UN eu(N) 


WN-L4i:N = WN-L41:N-1 + ST 
1+un PnN—L4iu:n-1Uy 


eu(N) = d(N)-—unwn—141:n-1 
_ Py—141:n-1UNUN PN-L4UN=1 
Py-t+i:n = Pn~141:n-1 a SS 


1+ un Pn-14i.n-1Uy 


(c) For simplicity of notation, denote the vectors {wn—z:n-1, WN-L+1:N} by (wk a wk}; 
i.e., they are solutions based on windows of length L and they become available at times 
{N — 1, N}. Using this notation, conclude that the solution w% can be computed recursively 
as follows. Start with w? = 0 and PI, = I} and repeat for i > 0: 


ea(i —L) = d(i— L)— ui-rwl, 

yXu(i-1) = 1/(-w-rLPLiul) 

gii = PLbubaw-1 

wl = uwL,-gLe(i- L) 

pr = PHit+ghi (gEi)" li- 1) 
€u(t) = d(i)-uwLi 

yi) = M ui Paul) 

gi = Piyali- 1) 

wy = wit +g euli) 

PL = PET -a (ey) 


Problem VII.35 (Subspace constrained RLS) Consider an M x m full-rank matrix A (M > 
m) and let w be any vector in its range space, i.e., w € R(A). Let wy denote the solution to the 
following regularized least-squares problem 


N 
Wty" Tw +) AY a(g) — uw)? 


m 
weR(A) j-o 


where II > 0 and u; is 1 x M. Find a recursion relating ww: to ww. 


Problem VII.36 (Block RLS) In our treatment of recursive least-squares problems in Secs. 30.2- 
30.6 we assumed that the new data that are incorporated into (yw, Hn} in (30.5) consist of a scalar 
d(N) and a row vector un. However, there are applications where block updates are necessary, in 
which case d( N) is replaced by a vector, say dw, and ux is replaced by a matrix, say Un. In these 
situations, the same arguments that were employed in Sec. 30.2 can be repeated to develop a block 
RLS algorithm. The purpose of this problem is to derive this algorithm. 

Consider a regularized least-squares problem of the form 


min [(w — ©) Hw — d) + (yv-1— Hy-1w) Wu-i(yv-1- Hy-1w)], H20 


and partition the entries of (yv -1, Hn-1} into 


do Uo 

di Ui 

YN-1 = . ; Hn- = . 
dn-1 Un-1 


where each d; has dimensions p x 1 and each U; has dimensions p x M. We further assume that the 
positive-definite weighting matrix Wy _, has a block diagonal structure, with p x p positive-definite 
diagonal blocks, say Ww: = diag{ R3 |, R5 5,..., Rg ,). Let ww-1 denote the solution of the 
above least-squares problem and let P1 = (II + HX_,;Wn-1Hwn-1)7}. 


(a) Show that Py = Py-1 — PN -1UNT NUN Py-1, with initial condition P-; = II! and 
where Dy = (Rw + UN PN AUN) |. 


(b) Show that wy = wn-1 + PN -iUNT n [dn — Unwn-i]. 
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(c) Conclude that wx can be computed recursively by means of the following block RLS algo- 
rithm. Start with w—1 = db and P_, = II^! and repeat for i > 0: 


(Ri + Ui Bj-1Uf)! 
PiAUTTi 


wi-1 + Gildi — Uiwi-i] 
Pi-1 - GIJ G} 


(d) Establish also the equalities Gy = PNU Ry andl y = Ry! — Rp Un PNUS RG. 

(e) Let {rn, en} denote the a posteriori and a priori error vectors, ry = dw — Unww and 
en = dn — Unwn-y. Show that Ry'r(N) =Tyen. 

(f) Let E(N — 1) denote the minimum cost associated with the solution wy—1. Show that it 
satisfies the time-update relation 


Conclude that £(N) = Yu. eiliei. 


Problem VII.37 (Alternative form of block RLS) Consider the same regularized cost function 
of Prob. VIL36 and define @y = II + HNWu Hy and sy = HNWu(yu — Hn). Show that 
(9, 5i, wi} satisfy the following recursions. Start with &..; = II, s_: = 0, and repeat for i > 0: 


$; = 6,1 + US RFU; 
Si = 8i—1 + Ut R; (di — Ui] 
solve ®; (w; — i») = s; 


Problem VII.38 (Exponentially weighted block RLS) Consider a regularized least-squares prob- 
lem of the form 


N 
min AY (w — @)*T(w — à) + S, AV (dj — Ujw)* R; (dy — U5w) 
j-0 
where each d; is p x 1, each U; is p x M, and each R; is p x p and positive-definite. Moreover, 
0 « A € 1isan exponential forgetting factor and II > 0. Let ¢( N) denote the value of the minimum 
cost associated with the optimal solution wy. 


(a) Repeat the arguments of Prob. VII.36 to show that the solution wy can be time-updated as 
follows: 


i (Ri + ACUBLUD)FO! 
Gi = AT PUT: 


ei = di-Uiwi-i 

wi = Wwi-ictGildi -Uiwi-i, w1=@ 
P, = A P- GrG}, Palm! 
Ti = di = Uiwi 


Eli) = Agli- 1) teien €(-1)=0 
A£(i — 1) 4 ri R; lei 


Verify also that the quantities (G;, ';) admit the alternative expressions G; = P;U7 R; | and 
T: = R7’ -RQUBULR;. 
(b) Repeat the arguments of Prob. VII.37 to show that the solution wy can also be evaluated as 
follows: 
©; = ABL +U RU, ð- =I 
Si = ASi-1 + U? R; [di - Uj), s-, =0 
solve $; (w; — T) = si 
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In other words, wi = Ù + OF !s;. 


Problem VII.39 (Block RLS with singular weighting matrices) Probs. VII.36-VII.38 assume 
positive-definite weighting factors ( R; ). This restriction can be relaxed. Thus consider a regularized 
least-squares problem of the form 


N 
min | ANH (w — a) Tl(w — à) + 3 A" (d; — Ujw)' A;(dj — Ujw) 
w per! 
where each d; is p x 1, each U; is p x M, and each weighting factor A; is p x p and positive 
semi-definite (hence, possibly singular). Moreover, 0 < A < 1 is an exponential forgetting factor 
and II > 0. Repeat the arguments of Prob. VIL37 to show that the solution wy can be evaluated 
iteratively as follows: 
$6; AO; 4 -UT AUi, 6-1 =I 
8; = ÀAsi-1 + Ut Ai|di - Uii], s-1 =0 
solve $;(w; — Ü) = si 
Argue that 5; is invertible. 


Problem VII.40 (Data fusion) At each time n > 0, a total of M noisy measurements of a 
scalar unknown variable x are collected across M spatially-distributed sensors, say y, (n) = x + 
vi(n), k 2 0,1,..., M — 1. The noises (v, (k)) are assumed to be spatially and temporally white 
with variances (c2 ,,(n)}. 
(a) Show that the minimum-variance unbiased estimator (m.v.u.e) of z given the observation 
vector y, = col(yo(n), yi(n),..., y 1 (n)) is given by 


M» yx(n)/o2,4.(n) 


En = M-I 
k=0 1/02 ,(n) 


(b) More generally, assume z is estimated instead by solving a deterministic least-squares prob- 
lem of the form: 


N M-1 
min D $2 ox(n) ly (n) - 2") => EN 


where 0 < A < 1 is an exponential forgetting factor and a,(n) are some nonnegative 
weighting coefficients (for example, œp (j) = 1/02 ,(n)). Show that £w can be computed 
recursively as follows: 


M-1 


Aé(n 1) M^ a(n), o(-1) =0 
k=0 


2 
i 
I 


M-1 


As(n —1) + D ox(n)yk(n), s(—1) =0 
k=0 


5 
= 
[ 


En s(n)/ó(n) 
Problem VII.41 (Data reweighting) Consider a regularized least-squares problem of the form 


min [w'Itw + (yn-1 — Hy-iw)*Wn-1(yn-1 — Hw-1w)], H»0 


and partition the entries of {yn -1, H1) into 


yN-1— . i Any-1= 
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where each d; has dimensions p x 1 and each U; has dimensions p x M. We further assume that 
the positive-definite weighting matrix Wy_1 has a block diagonal structure, with p x p positive- 
definite diagonal blocks, Wn—ı = diag{ R7}, Rp 3,..., RẸ 1}. Let ww. denote the solution 
of the above least-squares problem. Now assume the weighting matrix Wy is related to Wn_1 as 
follows 
Ws Dy-iWn-1 = 

Ry 
where the diagonal matrix D has the form Dy—1 = diag{Ip,...,Ip, Ip, Ip,...,Ip}, and 8 > 1 
is a positive scalar. The scalar 3 appears at the location corresponding to the k—th block Rz. Find 
a recursion relating wy to ww. 


COMPUTER PROJECTS 


Project VII.1 (OFDM receiver) This project illustrates how least-squares and least-mean-squares 
solutions are useful in the design of orthogonal frequency-division multiplexing (OFDM) receivers. 


XN -1) 


&(N —2) 
s 
(N - 1) N- 3(0) i: 
xN - 2) i WN-iy ali) 


t n i 
s(0) $(0) Add &(N ~ P) 
IDFT cyclic Channel (+) 
prefix 
Channel = 
Data 


&(N ~ 2) 
detection 


#(0) 


FIGURE VII.2 OFDM transmitter and receiver structures. 


In an OFDM system, data are transmitted in blocks, say of size N symbols each, with additional 
P symbols added for cyclic prefixing purposes (in a manner similar to what we discussed before in 
Probs. 11.27 and II.28). Specifically, consider a block of data of size N, 


8 =col{ s(N — 1, s(N —2),...,8(0) } 


Before transmission, this block of data is transformed by the inverse DFT matrix, i.e., s is trans- 
formed into 8 = Fs, where F is the unitary DFT matrix of size N defined by 


[Flix £ ae, i,k=0,1,...,.N—-1, j=v-1 


Then a cyclic prefix of length P is added to the transformed data, so that the transmitted sequence 
ends up being (see Fig. VII.2): 


col{ E(N — 1), 8(N — 2),...,8(0), 8(N — 1), (N —2),...,8(N — P)) 
a po, € M á————— 


transformed block of size N cyclic prefix of size P 


with the rightmost sample transmitted at time instant 1 and the leftmost sample transmitted at time 
instant N + P. At the receiver, these (N + P) samples are observed in the presence of additive noise 
and collected into a vector, say 


col{ g(N+P— 1,g(N + P-2),...,9(P), y(P- 1),...,9(2,9(1,9(0)) 


last N received samples first P received samples 


The first P received samples are discarded, while the remaining N received samples are collected 
into an N x 1 vector, y. 


Data model. Assuming an M-tap FIR model for the channel with impulse response sequence, 
h =col{h(0), h(1),..., (M — 1)) 


we can verify that (recall Prob. 11.27): 


oN 4 P - 1) v(N-P-1) 
IN + P-2) v((N-P-2) 
&(N-P-3) |=| €(N-P-3) | 4 
g(P) (P) 
— 00 a 
4 y 4 v 
h(0) h(1) en h(M — 1) 
(0) A(1) ee h(M —1) 
^(0) h(1) ee h(M —1) &(N — 1) 
3 a(N — 2) 
: 3(N — 3) 
h(0) h(1) e A(M —1) l 
a . . : 4(0) 
h(2 «++  h(M-1) h(0) h(1) —— 
h(1) h(2) sx h(M — 1) h(0) 4; 
H 


or, more compactly, 


(transform domain) 


where H is the channel matrix and 9 denotes measurement noise in the transformed domain. 


Now observe that H has a circulant structure and, therefore, it can be diagonalized by the DFT 
matrix. Let A = FHF'" be the diagonalized channel, and multiply the above equation by F from 


the left. Then we obtain 
(time domain) (VII.1) 


where the time-domain vector quantities (y, 8, v) are defined by 


Let A be a column vector with the entries of A, ie, 4 = diag{A}. Then it holds that (see 
Prob. VII.27): 


A = VNF* | hmxı | (VIL2) 


Oww—M)x1 


That is, A is the inverse DFT of the channel impulse response with its length extended to N. 


Data recovery. Given the linear relation (VII.1) between (y, s), and assuming the channel is known, 
we can invoke the results of Sec. 5.1 to conclude that the linear least-mean-squares estimator of s 


555 


ees 
Part VII 


PROBLEMS 


556 


ee n 
Part VII 


PROBLEMS 


given y is 

ê = R,A' [Re - AR,A"] y 
where R, = E ss* and R, = Evv*. Assuming that the data {s(-)} and the noise (v(-)) are i.i.d. 
with variances o? and c2, respectively, then R, = oĉI and R, = oZI. In this way, the above 
expression for 8 simplifies to 


-1 


8 = A* [2 + AA] y | (data recovery) (VIL3) 


Observe that the matrix being inverted is diagonal, which therefore results in a simple receiver struc- 
ture as shown in Fig. VIL2. If we denote the entries of y by 


y =col{ y(N + P-1,y(N +P - 2)... y(P)) 


then the estimators for the individual symbols are given by 


AT 
ŝli) = a y(i--P), i=0,1,...,N -1 
2 epp 

We may remark that expression (VII.3) for 8 in terms of y could have also been obtained had we 
treated (5, y) as deterministic quantities (5, y) (i.e., as realizations) and formulated the problem of 
estimating s from y in the regularized least-squares sense 


; 1 2 2 
min [gigle + lv- Asi] 


with regularization parameter 1/SNR, where SNR = o2/o2. 


Training. In order to recover the transmitted signals (i.e., in order to estimate s) as above, the channel 
taps are needed (i.e., A is needed). Different training schemes can be used to enable the receiver to 
estimate the channel and, consequently, A. The most common training scheme is to allocate some 
of the tones, i.e., some of the (s(i)) in an OFDM symbol, to known training data. We shall refer to 
these known symbols by writing {s(i)} (with normal font instead of boldface letters) since they are 
not random any longer. The channel taps are then estimated as follows. We first rewrite (VII.1) as 


s(N — 1) 
s(N —2) A 


s(0) 


Let (ki, k2,:-- , kx} denote the indices of the L (L > M) elements of s that are used as training 
tones, and are therefore known. We collect these transmitted training tones, which are known and, 
hence, deterministic quantities, and the corresponding received data into two vectors 


col(s(kr), s(kr1),...,s(k1)) 
col(y(kr + P),y(kr-1+P),...,y(ki + P)) 


St 
Yt 


Let Q denote the corresponding M x L submatrix of F', 


(Flo. loko = Foes 


Fhe [F] ka S Fh, (M x L) 


Pict, Mate oe Poa 


and let S; be the corresponding L x L submatrix of S, i.e., Se = diag{s¢}. Then 


ye =VN-S:Q" h +v 
^ —— M" t 
Lx1 LxM Mx1 


where v+ is the corresponding noise vector, v; = col[v(kr + P),...,v(ki + P)). We can now 
recover h by solving a least-squares problem, namely, as 


* x*1—1 » 
VN [QS SQ" QS? yt 


Using the fact that S; is diagonal, and assuming the training data satisfy |s(;)|? = 1 so that S% S: = I, 
then the above least-squares expression simplifies to 


ye(kr, + P)/se(kx) 


- T *]—1 $* : h 1 . ; 
JZ&:(QQ' Q MUERE CUN (channel estimation) 
ye(ki + P)/se(ki) 


Channel tracking. The above explanation assumes that one OFDM symbol is used to estimate 
the channel taps. However, in practice, this procedure is not accurate enough due to noise and 
variations in the channel. Therefore, multiple OFDM symbols are usually used to estimate and track 
the channel. Thus assume that k OFDM symbols are transmitted with the training tones in the same 
locations. Then repeating the equality y = S:Q*h + v; for the k received symbols we have 


Yt,k St,k Vek 
Vtk-i St k-11 7 Utk-1 
= > Q'hct 
Yel Su Uti 
where the indices (1,..., k} correspond to the received OFDM symbols. Now, if we assume that 


the training data used in different symbols are the same (which is a valid assumption in practice), 
then the least-squares estimate of h is simply the average of the channel estimates derived for each 
OFDM symbol separately, i.e., 


where f; is the channel estimate derived from the i-th received symbol, as in the previous subsection. 


Training schemes. Different training schemes can be used in OFDM systems to estimate the chan- 
nel, as indicated in Fig. VIL3. One scheme is to assign all tones in an OFDM symbol for training, 
and then use all tones in the subsequent symbols for data transmission. In other words, some symbols 
are entirely for training purposes and others are purely for data transmission. This scheme is shown 
in Fig. VII.3(a). An alternative scheme is to allocate certain tone indices for training on all symbols 
and to use the other tones for data transmission. In this scheme, the training sequence and the data 
are transmitted simultaneously on all symbols, as shown in Fig. VII.3(b). The advantage of the first 
scheme is that the matrix inversion (QQ")^! becomes mute since QQ* = I in this case. Yet, this 
scheme cannot track channel variations occurring between the symbols assigned for training. The 
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* Symbols > * Symbols * 


(a) (b) 


FIGURE VII.3 Two training schemes for an OFDM system: (a) training followed by data transmission; and 
(b) simultaneous training and data transmission. 


Remark. Comparing the structure of an OFDM system with that of a single-carrier frequency-domain-equalization 
system, as described earlier in Prob. 11.28, we find that in the latter scheme a cyclic prefix is added to the raw (time- 
domain) data before transmission. In OFDM, on the other hand, the data is first transformed to the frequency 
domain and then a cyclic prefix is added before transmission. As a result, OFDM systems tend to suffer from a 
peak-to-average ratio (PAR) problem, i.e., large signal peaks can occur due to constructive interferences among 
the signals in the sub-channels. For a comparison of both modulation schemes (single-carrier-frequency-domain 
equalization and OFDM) see, e.g., Sari, Karam, and Jeanclaude (1995) and Czylwik (1997). 


Simulations. The following are the parameters used in our simulation of an OFDM system: N = 64 
(block or symbol size), P — 16 (cyclic prefix length), M — 8 (channel length) and L — 8 (training 
tones in a symbol). The transmitted data are drawn from a QPSK constellation with unit power. The 
channel taps are complex numbers with Gaussian distribution for the real and imaginary parts. 


(a) Write a program that simulates the performance of the two training schemes of Fig. VII.3. 
Specifically, for the first scheme, use all 64 tones of the first symbol for training and channel 
estimation. For the second scheme, use only 8 tones corresponding to indices 


(4, 12, 20, 28, 36, 44, 52, 60) 


for training. In order to perform a fair comparison, in the second scheme, estimate the channel 
only after receiving 8 consecutive symbols. This is to ensure that both schemes have the same 
total training overhead. Set the SNR at 10 dB at the input to the receiver and simulate the 
operation of the OFDM receiver; estimate the channel taps both in the time domain (h) and 
in the frequency domain (A). 

(b) For each of the training schemes of Fig. VII.5, generate BER curves versus SNR by averaging 
over 100 randomly generated channels, i.e., perform training and data recovery over 100 
channels and average their bit-error-rate performance. Vary the SNR between 0 and 20 dB. 
Generate also the BER curve assuming exact knowledge of the channel. Fix the SNR at 
25 dB and plot the scatter diagrams of the transmitted sequence, the received sequence, and 
the recovered data using scheme (a) from Fig. VII.3. 


Project VII.2 (Tracking Rayleigh fading channels) We reconsider the problem of tracking a 
Rayleigh fading channel from Computer Project IV.2. Let w? denote the tap vector of an M —th 
order multipath channel and assume all taps fade at the same Doppler rate fp. We mentioned in 
Computer Project IV.2 that one simple approximation for the time-variation in the channel would be 
to model it as w? = aw?_; + q;, where a = Jo(2r fp 1.) and q, is an i.i.d. random vector with 
covariance matrix Q = qI = (1 — o?)I. The quantities {a, q} are defined in terms of the maximum 
Doppler frequency fp, the sampling rate Ts, and the Bessel function 


Jo(z) 4 if cos(x sin 6)d@ 
0 


7T 


which characterizes the auto-correlation sequence of each of the taps as 


(a) 


(b) 


(c) 


r(k) $ Ez(n)z(n — k) = Jo (2nfpT, k), kz...,-1,0,1,... 


Let M = 5, fp = 10Hz, and T: = 0.8us. Design an adaptive tracker for the channel using 
1) e-NLMS with u = 0.25; 2) RLS with A = 0.995; 3) the extended RLS algorithm (31.42) 
with the above values for œ and q and A = 0.995. Generate learning curves for the three 
algorithms by averaging over 200 experiments. Set the SNR level at 30 dB. Compare the 
steady-state mean-square error values with those predicted by theory. 


Increase the Doppler frequency to fp = 80Hz and repeat the simulations. What do you 
observe? 


In parts (a) and (b), the value of the parameter a is very close to unity and the value of q is 
also very small. In order to illustrate more clearly the difference in performance that arises 
between RLS and its extended version (31.42), consider a model with more distinct values for 
{a, q}, say a = 0.95 and q = 0.1, and repeat the simulations of part (a). Repeat also for 
a = land gq = 0.1. Start from an initial condition w2, that is normally distributed with 
identity covariance matrix. 
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Chapter 33: Norm and Angle Preservation 
Chapter 34: Unitary Transformations 
Chapter 35: QR and Inverse QR Algorithms 
Summary and Notes 

Problems and Computer Projects 


dd 


Norm and Angle Preservation 


A... methods are powerful algorithmic variants that are theoretically equivalent to the 
recursive least-squares algorithm, but they nevertheless perform the required computations 
in a more reliable manner. In these array forms, an algorithm is described not as an ex- 
plicit set of equations, but as a sequence of elementary operations on arrays of numbers 
(or matrices). Usually, a pre-array of numbers has to be triangularized by a sequence of 
elementary rotations in order to yield a post-array of numbers. The quantities needed to 
form the next pre-array are read off from the entries of the post-array, and the procedure 
can be repeated. The explicit forms of the rotation matrices are not needed in most cases, 
and they can be implemented in a variety of well-known ways, e.g., as a sequence of ele- 
mentary circular or hyperbolic rotations. The purpose of this chapter is to develop several 
array-based methods for RLS filtering. In order to motivate such array algorithms, we shall 
first consider a simple (yet contrived) example that helps highlight some important issues. 


33.1 SOME DIFFICULTIES 


Thus, consider the update equation (30.12) for the variable Py in the RLS algorithm, and 
assume that all variables are real and scalar-valued (and, hence, we shall write (u(N), P(N)} 
instead of (uw, Pn }). Assume further that at some iteration no, especially during the ini- 
tial stages of adaptation where P(N) is more likely to assume large values, u(n;) = 1 
and P(n, — 1) is sufficiently large so that, in finite precision, the value of 1 + P(n; — 1) 
evaluates to P(n, — 1), i.e., 


14 P(n, —1) = P(n, - 1) (1n finite precision) (33.1) 


From (30.12) we have that the value of P(n,) is obtained via the update 
P?(n, — 1) 


P(no) = P(no — 1) - Tt Pm) (33.2) 
which is also equivalent to 
P(no) = mes (33.3) 
Now assume that the term 
P'(n, - 1)/[1 + P(no — 1)] 
in (33.2) is evaluated as follows: 
2 
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That is, we first evaluate the ratio 
P(n, "EM 1)/[1 + P(n, = 1)] 


and then multiply the result by P(n; —1). Then, because of (33.1), the above ratio evaluates 
to 1 and, therefore, if we compute P(n,) using (33.2) we get 


P(nj-2P(n,—-1) - P(ng-1)-1=0 


On the other hand, recursion (33.3) gives P(n;) = 1. 

The values obtained for P(n,) are obviously different and the second one is the desired 

value since, from (33.3), 

erm P(n;) 21 

We therefore find from the equivalent expressions (33.2) and (33.3) that these two different 
implementations of the same recursion can behave differently in the presence of round-off 
errors. There is a reason for the failure of the implementation based on (33.2); it evaluates 
the nonnegative quantity P(n,) as the difference of two nearly equal and large nonnegative 
numbers, and the result is an undesired cancellation; in some other more complex scenar- 
ios, the variable Py may even lose its positive-definiteness. Such cancellations are often, 
but not always, bad phenomena in finite-precision implementations so much so that they 
are usually called catastrophic cancellations! 

This simple example shows that there is always merit in considering alternative imple- 
mentations even for the same algorithm. Another example is presented in Prob. VIIL11 
where two theoretically equivalent implementations of the same algorithm are also shown 
to react differently in response to perturbation in the data. 

For these and other reasons, we are motivated to develop array methods for recursive 
least-squares problems. The array methods will have several intrinsic advantages over a 
plain RLS implementation; for one thing, they will be more reliable in finite-precision 
implementations (as explained in the sequel and as illustrated later in the computer project 
at the end of the chapter). We start with a handful of definitions. 


33.2 SQUARE-ROOT FACTORS 


A key element in array algorithms is the concept of a square-root of a positive-definite 
matrix. 

Definition 

Although the concept of square-roots can be defined for nonnegative-definite matrices, it 


is enough for our purposes here to focus on positive-definite matrices. Thus, let A denote 
an n x n positive-definite matrix and introduce its eigen-decomposition 


A=UAU* (33.4) 


where A is an n x n diagonal matrix with real positive entries, which correspond to the 
eigenvalues of A, and U is a unitary matrix, namely an n x n square matrix that satisfies 


UU* =U*U =I 


The columns of U correspond to the orthonormal eigenvectors of A as can be seen by 
re-writing (33.4) as AU = UA. Let A1/? denote a diagonal matrix whose entries are the 


positive square-roots of the diagonal entries of A and, hence, 
Asia (ay 
Then we can rewrite (33.4) as 
A= (uai?) - (uay 
which expresses A as the product of an n x n matrix and its conjugate transpose, namely, 
A=XX* with | X2UA? 


We say that X is a square-root of A. 


Definition 33.1 (Square-root factors) A square-root of an n x n positive- 


definite matrix A is any n x n matrix X satisfying A = X X*. 


The construction prior to the definition exhibits one possible choice for X, namely, 
X = UA, in terms of the eigenvectors and eigenvalues of A. However, square-root 
factors are highly nonunique. This is true even for scalars. For instance, the number 4 
has infinitely many square-roots over the field of complex numbers, namely, 2e/? for any 
$ € [77,7]. For matrices, if we take the above X and multiply it by any unitary matrix 
O, say, X = XO where OO" = I, then X is also a square-root factor of A since 


XX* = X0e*Xx'- XX'—A 
=I 


Notation 


It is customary to use the notation A1/? to refer to a square-root of a matrix A and, there- 
fore, we write 


isa (ay 


It is also customary to employ the compact notations 
VOLES PUE A285 CON 2 A78 À (avr) 


so that 
A= AV? A*/2. AT! = A-7*/24-1/2 (33.5) 


Cholesky Factor 

One of the most widely used square-root factors of a positive-definite matrix is its Cholesky 
factor. Recall that we showed in Sec. B.3 that every positive-definite matrix A admits a 
unique triangular factorization of the form 


A=LL* (33.6) 


where L is a lower-triangular matrix with positive entries on its diagonal. We could also 
consider the alternative factorization A = UU* in terms of an upper triangular matrix U. 
However, the lower triangular form will be the standard form for our discussions. The 
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factor L is called the Cholesky factor of A. Comparing (33.6) with the defining relation 
(33.5), we see that L is also a square-root factor of A. For our purposes, whenever we refer 
to the square-root of a matrix A we shall mean its Cholesky factor. It has two advantages 
in relation to other square-root factors: it is lower triangular and is uniquely defined (i.e., 
there is no other lower triangular square-root factor with positive diagonal entries). 


Array Algorithms 
Now, an array algorithm generally implements transformations of the form 


x xX XX 
x x X X 
x xX xX X 
x x xX xX 
(0) 
ll 
x xX xX X 
x xX xX Oo 
xxoo 
x ooo 


where © is some unitary matrix whose purpose is to transform the pre-array of numbers 
to some triangular form. There are many ways to implement unitary transformations of 
this kind. For example, it is explained in Chapter 34 that © could be implemented as a 
sequence of elementary transformations (known as Givens rotations) or reflection transfor- 
mations (known as Householder reflections). Such array descriptions have many intrinsic 
advantages: 


1. They have better numerical properties for two main reasons. First, unitary transfor- 
mations are numerically well-behaved since they do not amplify numerical errors. 
And second, the entries in the pre- and post-arrays are usually square-root factors 
of certain variables and these entries tend to assume values within smaller dynamic 
ranges. 


2. They are easy to implement as a sequence of elementary rotations or reflections, as 
we explain in Chapter 34. 


3. They admit modular and parallelizable implementations, since each rotation or re- 
flection can be applied simultaneously to all rows of the pre-array. 


33.3 PRESERVATION PROPERTIES 


Since unitary transformations are at the core of most array methods, it is important to 
examine some of their most distinctive properties. To begin with, a key property of unitary 
transformations is that the norms of vectors and the inner products between vectors are 
preserved by them. To see this, assume that x and y are two row vectors that are related by 
some unitary transformation ©, say, rO = y, then 


A x A 
ly? = yy* = z0e*z* 2zz* = |x|? 


That is, the vectors {x,y} will have the same Euclidean norm. We therefore say that 
unitary transformations preserve Euclidean norms: 


z9-y => |z|? = Ify]? | (norm preservation) (33.7) 


16 Accordingly, the square-root of a positive number is taken to be its positive square-root. 


In addition, if a and b are two other row vectors that are related by the same transformation, 
say, aO = b, then we have that 


yb* = zxOO*"a* = za* 


and we find that unitary transformations also preserve inner products between vectors, i.e., 


zO = yand a0 = b = «sa* = yb* (angle preservation) (33.8) 


The reason why we are referring to the inner-product preservation property as an angle- 
preservation property is the following. For the case of real data we have 


za” = |[z|- llall - cos(8) 
with 0 denoting the angle between z and a. Likewise, 
yb" = |lyl| - Ibl] - cos(a) 
with a denoting the angle between y and b. But since 
la] = ||b]] and izli = [yl 
we then conclude from za! = yb! that 
cos(0) = cos(a) 


This latter equality amounts to angle preservation. We thus say that the vectors {z, a} are 
transformed into (y, b) in such a way that their norms are preserved as well as the angles 
between them (see Fig. 33.1). 

The conclusions (33.7)-(33.8) can be combined into a stronger (if, and only if) state- 
ment, which will be the basis for most of our derivations of array algorithms. 


Lemma 33.1 (Basis rotation) Given two n x M (n € M) matrices A and B. 
Then AA* = BB* if, and only if, there exists an M x M unitary matrix © 
such that A — BO. 


FIGURE 33.1 Rotation of vectors (a, x} into (b, y) with the vector norms and the angles between 
them preserved: 0 = a, ||a|| = ||b]|, and |||] = Iyl. 
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Proof: One direction is obvious. If A = BO, for some unitary matrix ©, then 
AA* = (BG)(BO)' = B(99*)B* = BB* 


One proof for the converse implication follows by using the singular value decompositions of A and 
B (cf. Sec. B.6) — see Prob. VIII.2 for another proof: 


A=Ua[ se o]vi. B=Us| £s 0 | Ve" 


where U 4 and Ug are n x n unitary matrices, V4 and Vg are M x M unitary matrices, and D4 and 
Xn are n x n diagonal matrices with nonnegative entries. The squares of the diagonal entries of 34 
(£s) are the eigenvalues of AA" (B B"). Moreover, UA (Un) are constructed from an orthonormal 
basis for the right eigenvectors of AA* (BB"*). Hence, it follows from the identity AA* = BB* 
that Z4 = Up and Ua = Ug. Let O = Vs Vi. Then G0* = I and BO = A. e 


Our derivation of array algorithms will be based on the result of Lemma 33.1. Actually, 
the result of the lemma is more than we need. It will be enough for our purposes to use 
only one direction of the lemma, namely, the statement that 


AA* = BB* => There exists a unitary © such that A= BO (33.9) 


33.4 MOTIVATION FOR ARRAY METHODS 


Before plunging into the derivation of RLS array methods, we shall study a simple example 
in some detail in order to highlight the main ideas underlying the mechanics of array meth- 
ods. Thus, consider two scalars {a,b} and assume that we wish to evaluate the positive 
scalar c that satisfies 

le? = al? + |o? (33.10) 


The first method that comes to mind is to evaluate the squares |a|? and |b|?, add them, and 
then compute the square-root of the sum to find c. 


Preservation of Norms 
A less obvious way for determining c, albeit one that will be more useful for our purposes 
(especially when we deal with matrix quantities {A, B, C} as opposed to scalars {a, b, c}), 


is to use an array method. It can be motivated as follows. Observe that the right-hand side 
of (33.10) is the sum of two squares and it can be expressed as an inner product: 


aP +P = [a olle] 


Similarly, the left-hand side of (33.10) can be expressed as an inner product: 
|? - [e 0/5 | 
In this way, relation (33.10) in effect amounts to an equality of the form 


real t] - Eo 1] | (33.11) 


which has the same form as AA* = BB* with the identifications 
A— fle 0], B— fa bj (33.12) 


Therefore, using (33.9), we conclude that there should exist a 2 x 2 unitary matrix O that 
maps [a b] to [c 0], 
[a b}]O=[c 0] (33.13) 


If we can find ©, then applying it to the pre-array [a b] would result in the desired value 
for c. 

Now recall that the proof of Lemma 33.1 provides an expression for the unitary matrix 
O that transforms B to A. However, that expression is in terms of the right singular vectors 
of ( A, B} and, therefore, it requires that we know beforehand both A and B. Clearly, this 
construction is not helpful in situations like (33.13) where A is not known. For this reason, 
the conclusion (33.9) is useful only in that it guarantees the existence of a O that performs 
the required transformation (33.13). 

In order to find © we would argue differently as follows. Choose any unitary © that 
transforms the pre-array, [a b], to the generic form 


[a 6]@=[x 0] (33.14) 


That is, choose any unitary © that annihilates the second entry of [a b] and let x denote 
the resulting leading entry of the post-array. We explain in Lemma 34.2, and in the remark 
following it, how such a © can be found for any [a b], e.g., a Givens rotation could be 
used, which in this case would be given by 


eita t m 


viv] Le 1 


e|] | ifa=0 


| ifa #0 and where p = b/a 


where (6, Øb} are the phases of (the possibly complex numbers) (a, b). If we apply this 
choice of © to the pre-array [a b} in (33.14), then it is easy to see by direct calculation 
that a post-array of the form [x 0] will result. Specifically, as explained in the remark 
following Lemma 34.2, we get 


[x o] = [ylab 0] 


which readily identifies x as the desired c. 

Alternatively, the value of x could have been identified by using the property (33.7) 
and without assuming any explicit knowledge of O. Indeed, (33.7) states that unitary 
transformations preserve Euclidean norms, so that the norm of the pre-array [a b] must 
coincide with the norm of the post-array [x 0]. Therefore, by “squaring” both sides of 
(33.14), namely, by writing 


te elec] J=Cx o1[ 9 | 


we get 
jal? + |b? =| x P 
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so that x — c. 

In summary, this discussion shows that whenever we encounter an equality of the form 
(33.10), where both sides can be interpreted as sums of squares and can therefore be ex- 
pressed as inner products of certain vectors with themselves, then we can compute the 
left-hand side entry in array form as follows. Form a pre-array and annihilate its second 
entry by means of any unitary rotation that results in a positive leading entry in the post- 
array, as in (33.14). Then this leading entry should be the desired c. 


Preservation of Inner Products 

In order to further appreciate the convenience of the array formulation, assume now that in 
addition to the scalars (a, b, c) satisfying (33.10), we are also given scalars (d, e) and that 
we want to evaluate the scalar f that satisfies 


fc* = da* + eb* (33.15) 


Of course, given the (a, b) we could first determine c as explained above, and then evaluate 
f by dividing the right-hand side of (33.15) by c* (or c since c is real in this example). 
Alternatively, we could evaluate f in array form, just like we did for c, as follows. 

We start by noting that the right-hand side of (33.15) can be interpreted as the inner 
product between two vectors: 


da* +eb* = [d EN 


Now we explained above how to find a unitary transformation © that takes [a b] to [c 0]. In 
addition, we know from our discussions in Sec. 33.3, that any such unitary transformation 
preserves not only vector norms but also inner products between vectors (recall (33.8)). 
This second property can be used to our advantage here. Assume we apply © to both [a b) 
and [d e]. We already know that in the first case we obtain [c 0] as the post-array, whereas 
in the second case we would obtain some other post-array that we denote by [y z]: 


[a b]e = [e 0] 
e]89 = [y z] 


The preservation of inner products then implies that the inner-product of the pre-array 
vectors should coincide with the inner-product of the post-array vectors, i.e., 


a* c* 
i e|] Dr |t] 
or, equivalently, 
da* + eb* = yc* 
Comparing with (33.15), we see that we can immediately identify y as the desired f. As 


for z, while it is not of immediate interest to us here, its value can be identified by noting 
that [d e] and [y z] must have identical Euclidean norms, so that 


Iz? + Lf? = Id? + lel? 


Array Description 

In conclusion, the above discussion shows that calculations of the type (33.10) and (33.15), 
aimed at determining (c, f) from knowledge of (a, b, d, e), can be accomplished in array 
form as follows. We form the pre-array 


A= | : : | (33.16) 


and choose a unitary matrix O that lower triangularizes A, namely, it reduces A to the form 


A0 = | x d (33.17) 
y z 


with a positive x. The determination of © is solely dependent on the first row of the 
pre-array .A; the entries of the second row of A are not used to define O. 

Then the entries { x, y} can be identified as the desired {c, f}; this identification follows 
from the preservation of norms and inner products by unitary matrices. In particular, the 
identification of x as c follows from the fact that the top rows of the pre- and post-arrays 
{A, AO} must have the same Euclidean norms, while the identification of y as f follows 
from the fact that the inner product of the rows in the pre- and post-arrays must coincide. 

Another way of carrying out this procedure for identifying the entries of the post-array 
AO is as follows. Given the pre-array A as in (33.16), then the entries of the post-array 
AO in (33.17) can be identified by “squaring” both sides of (33.17), i.e., by writing 


Ager A = Aw = | * mah z 
A Y'a * 


and then by comparing terms on both sides of the resulting equality: 


ab leo). Fe 01] x 07° 
allel lela; m 


Doing so results in the relations 
| x |? = fal? + Jb]? and yx* = da* + eb* 

which identify x as c and y as f. 
Vector Case 
The above example, with scalar entries (a, b, c, d, e, f }, illustrates the main ideas behind 
the derivation of array algorithms. In the context of adaptive filtering, however, we shall 
encounter vector analogues of relations (33.10) and (33.15), such as determining a lower 
triangular matrix C, with positive diagonal entries, satisfying 

CC* = AA* + BB* (33.19) 


and determining a matrix F satisfying 


FC* = DA* + EB* (33.20) 
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where {A, B, D, E} are generally matrix or vector quantities. The same arguments that 
we used above will reveal that (C, F} can be determined by means of an array method as 
follows. We form the pre-array (cf. (33.16)): 


4-[$ E] 


and reduce it via a unitary transformation O to the lower triangular form (cf. (33.17)): 


[v 2] 


where X is lower triangular with positive entries along its diagonal; sometimes Z is a square 
matrix and O is also required to generate it in lower-triangular form along with X: 


A B]a_ĪÎXx 0 
[4 BJe=[* 9] ean 


The matrix © is not 2 x 2 any longer; but it can be implemented as a sequence of ele- 
mentary (Givens) rotations or Householder reflections, as explained in App. 34, where we 
show how to lower triangularize a matrix via a sequence of rotations or reflections. The 
reader is encouraged to consult Chapter 34 at this stage in order to learn how such unitary 
transformations © are implemented. The subsequent presentation in this chapter assumes 
that the reader has familiarized himself with the material of Chapter 34. 

An explicit expression for O in (33.21) is not needed. All we need to do is find the right 
sequence of rotations that yields the desired triangular post-array. Then, by “squaring” 
both sides of (33.21) we get 


A Blog| 4 B * [X o T. 
D E S DE! |YZ Y Z 
so that we must have 


XX* = AA*+BB* and YX* = DA* + EB* 


In this way, X can be identified as the lower triangular Cholesky factor of the matrix 
AA* + BB*, and since Cholesky factors are unique, we conclude that X must coincide 
with the desired C. From the second equality above we conclude that Y = F, so that 
the array algorithm (33.21) enables us to determine (C, F). We may remark that we are 
not restricted to array methods with two (block) rows in the pre-array and post-arrays as 
in (33.21). If additional relations are available that satisfy certain norm and inner-product 
preservation properties, then these could be incorporated into the array algorithm as well. 
A demonstration to this effect is the QR algorithm of Sec. 35.2. 


Na 


Unitary Transformations 


l. this chapter we describe two classes of unitary transformations (Givens and House- 
holder) that can be used to annihilate selected entries in a vector or matrix and thereby 
reduce a matrix to triangular (or similar) form, as is often required by array algorithms. 
Special care needs to be taken when dealing with complex-valued data as compared to 
real-valued data. 


34.14 GIVENS ROTATIONS 


Givens rotations provide an effective way to annihilate specific entries in a vector and it is 
enough to explain their operation on 2-dimensional row vectors. We consider the case of 
real-valued data first. 


Real data 
Consider a 1 x 2 real-valued vector z = [ a b J; and assume that we wish to determine 
a2 x 2 matrix O that transforms it to the form: 


[a b}]O=[a 0] (34.1) 


for some real number a to be determined, and where O is required to be orthogonal, i.e., it 
should satisfy 
ee’ -o'e-I 
We refer to [a b] as the pre-array and to [a 0] as the post-array. 
Now any orthogonal matrix O has the important property that it preserves vector norms. 
Indeed, it is easy to see from (34.1) that the following equality must hold: 


e eg [Jel [a] 


or, equivalently, a? + b? = a?. In this way, no matter which orthogonal transformation © 
we choose to implement the transformation (34.1), it will always hold that the Euclidean 
norm of the post-array should coincide with the Euclidean norm of the pre-array. This fact 
allows us to conclude the value of a, namely, 


a= +vya? +b? 


even before knowing the expression of any © that achieves (34.1). Note that there are two 
choices for a and which one we pick depends on how O is implemented, as explained next. 
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An expression for an orthogonal © that achieves the transformation (34.1) is given by 


1 1 —p | b 
© -——— where p=-, a#0 34.2 
"un | p 1 Fw * nd 


This choice is known as a Givens or circular rotation. Surely, it can be verified by direct 
calculation that this O is orthogonal and that it leads to 


la b]e e | verre? 0] 


The choice of which sign to pick depends on whether the value of the square root in the 
expression (34.2) for O is chosen to be negative or positive. In general, we choose the 
positive sign. 

The reason for the denomination circular rotation for O can be seen by expressing O in 
the form 


e-|: =$ 


zad EN a? 
pel POOR de pes ied 


in terms of cosine and sine parameters. In this way, we find that the effect of O is to rotate 
any point (z, y) in the two-dimensional Euclidean space along the circle of equation 


r +y =a Ho 


When a = 0, we simply select © to be the permutation matrix 


_]0 -1 
o- | 1 0 | (34.3) 
We could also choose 
e= 0 1 
11 0 


with +1 instead of —1. The effect of this permutation will be the same; we choose (34.3) 
with a minus sign so that (34.3) can be regarded as the limit of the Givens rotation (34.2) 
when p — oo. In summary, we have the following result. 


Lemma 34.1 (Real Givens rotation) Consider a 1 x 2 vector [a b] with real 
entries. Then choose O as in (34.2) to get 


[a b]O=4+Va?+b?[1 0] 


If a = 0, then choose © as in (34.3) to get [0 5]O = [b 0. 


Rotations versus Reflections 


We can also use elementary reflections as opposed to rotations. These are defined by 
matrices of the form 


"e. 1 Pp 
e-z |, 4] 969 


Observe that the determinant of a rotation matrix is equal to +1 while the determinant of a 
reflection matrix is equal to —1. 

The reason for the denomination “reflection” is the following. Let z = ||2||e/? denote 
the polar representation of a point z = [a b] in the two-dimensional Euclidean space. Let 
p = b/a and consider the matrix © defined by (34.4). The effect of this matrix on z is to 
align it with [1 0]. The manner by which this alignment is achieved is by reflecting z across 
the line that passes through the origin and the point (cos(0/2), sin(@/2)) — see Fig. 34.1. 


rotation 


FIGURE 34.1 Aligning a vector z with the first basis vector by means of rotation (left) or 
reflection across the line passing through (cos(0/2), sin(0/2))(right). 


The distinction between rotations and reflections becomes more obvious by examining 
their effect on other vectors (i.e., other than the vector z that determined them). So let, for 


example, 
z= |Và Vi] 


so that p = 1, ||z|| = 2, and 9 = 7/4. The corresponding rotation and reflection matrices 
are given by 


yi td, ow = 2]? ‘| 


rot. VG 
ica ki i 2- sq c 


Applying any of these matrices to z will align it with the basis vector [1 0], as shown in 
see Fig. 34.1. Consider now the vector 


z' = [0 2] 
which lies on the same circle as z. Multiplying z’ by O'* results in 
PI = [ /2 V2 ] 


so that z’ is rotated by 7/4 radians in the clockwise direction until it reaches its destination 
at [V2 \/2] — see Fig. 34.2. On the other hand, multiplying z’ by O'* results in 


ZO = [ v2 -v2 | 


so that, in this case, 2' is reflected along the line passing through the origin and the point 
(cos(m/8), sin(/8)) in order to attain its destination at [V2 — v2) — see again Fig. 34.2. 
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FIGURE 34.2 Rotating the vector [0 2] into [V2 V3] (left) vs. reflecting it into [V2 — V2] across 
the line passing through the origin at an angle of 7/8 radians (right). 


Complex data 
When the entries of z — [ a b ] are complex-valued, we should seek a unitary (as 
opposed to orthogonal) matrix © that achieves the transformation (34.1), namely, O should 
now satisfy 

oe*—-e*'o-I 


An expression for such a O is 


(34.5) 


Let a = |ale?*= denote the polar representation of a. Then it can be verified that we now 


obtain 
[a b]9= [| et Jaz +o o] (34.6) 


In other words, the value of œ is in general complex and its phase is determined by the 
phase of a. Again, the choice of the sign depends on whether the value of the square root 
in (34.5) is chosen to be negative or positive. When a = 0, we choose © as the permutation 
matrix (34.3). 


Lemma 34.2 (Complex Givens rotation) Consider a 1 x 2 vector [a b] with 
possibly complex entries. Then choose O as in (34.5) to get 


[a b]O= +e la? - jb? [1 0] 


where ġa denotes the phase of a. If a = 0, then choose O as in (34.3) to get 
[0 bO = [b 0]. 


Remark 34.1 (Real post-array) Usually, it is desirable to obtain a real-valued (and also positive) œ even 
in the complex case (34.6). This property can be enforced by choosing © as 


—jóa - 
ez = | ‘ a where =>, a#0 


virile? [s 1 


To enforce a > 0, we simply choose the plus sign in (34.6). If a already happens to be real-valued, then of 
course Ġa = 0 and the same © as before in (34.5) results in a real a. Note further that, in this case, the diagonal 
entries of © will be real. 

When a = 0, we choose © as 


0 


in terms of the phase of b. This remark applies to all our future discussions for complex-valued data; so it will 
not be repeated again. 


O = e7% 0 -1 
1 


o 

Example 34.1 (Using Givens rotations) 

Assume we are given a 2 x 3 pre-array A, 

1 0.75 0.25 
= 34.7 
A | 04 02 02 | $t) 
and that we wish to reduce it to the form 
Ae = | BOW s | (34.8) 
x x 0 


via a sequence of Givens rotations. This can be obtained, among several possibilities, as follows. We 
first annihilate the (1, 3) entry of A by pivoting with its (1, 1) entry. From the construction (34.2), 
we know that the orthogonal matrix O that achieves this transformation is given by 


DRE | 1 zl E ps —0.2425 
Es » 


Vi+ |m 1 0.2425 0.9701 


Applying O1 to A, and leaving the second column of A unchanged, leads to 


| , opi 025/12 0.25 


E 0.75 | ui i cs: er 0.7500 0 | 


0.4 0.2 0.2 0.2425 0 0.9701 0.4365 0.2000 0.0970 


Al 


We now annihilate the (1,2) entry of the post-array A; by pivoting with its (1, 1) entry. For this 
purpose, we choose the orthogonal matrix as 


Sy es foe . 9:7500 
: J£ pm 4 0.5884 0.8086 1.0307 


Applying ©2 to A1, and leaving the third column of A; unchanged, leads to 


1 | 1 *| E bes po M 


0.4365 0.2000 0.0970 0 0 1 0.4707  —0.0951 0.0970 
N ce! 
Ay A2 


0.8086  —0.5884 0 
1.0307 0.7500 0 1.2748 0 0 
[om [0.7600] | 0555." und iu =| 
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We finally annihilate the (2, 3) entry of A2 by pivoting with its (2, 2) entry. One way to achieve this 
transformation is to use the orthogonal matrix 


tem 1 1 -ps | _ | 0.7001 0.7140 , = 0.0970 
i Jit |m 1 —0.7140 0.7001 |’ —0.0951 
and to apply it to A2, without modifying its first column, thus leading to 
1.2748 0 0 | a ud ass 2 | 1.2748 0 0 
0.4707 |—0.0951]| [0.0970 ) i IO: -0. 
A2 A3 


Clearly, the negative entry —0.1359 could have been replaced by a positive entry 0.1359 had we 
employed —Q3 instead of O3. 

A more direct way to achieve the last step is to avoid forming O3 altogether and to simply replace 
the vector [—0.0951 0.0970] by [œ 0], where a is the norm of the vector, i.e., by (0.1359 0]. In 
this way, the resulting post-array becomes 


1.2748 0 0 
0.4707 0.1359 0 


We have therefore determined a succession of three elementary orthogonal transformations that 


triangularize the original pre-array A, The combined effect of these transformations is to achieve the 
desired transformation (34.8). 
o 


34.2 HOUSEHOLDER TRANSFORMATIONS 


In contrast to Givens rotations, Householder transformations can be used to annihilate mul- 
tiple entries in a row vector at once. We describe them below for both cases of real and 
complex data. 


Real data 
Let eg — [ 10.. 0 ] denote the leading row basis vector in n—dimensional Eu- 
clidean space, and consider a 1 xn real-valued vector z with entries {z(i), i = 0,1,...,n- 


1}. Assume that we wish to transform z to the form 
[z(0 2(1) ... z(n-1) ]O= ae (34,9) 


for some real scalar a to be determined, and where the transformation © is required to 
be both orthogonal and involutary. That is, © should satisfy 99" = I and ©? = I (or, 
equivalently, © = ©" and ©? = I). We remark in passing that matrices Q that satisfy 
Q? = I are called involutary matrices. 

Of course, the scalar a cannot be arbitrary and its value can be determined even before 
determining the expression for a matrix © that achieves (34.9). Indeed, note from (34.9) 
and from the orthogonality of © that 


z0e';T = |a? = o? 


so that we must have a = +||z/|. Both values of o are possible (since if © achieves 
zO = l|z||eo, then —O is orthogonal and achieves zO = —||z||eo). One way to achieve the 


aeo 


FIGURE 34.3 The vector z is aligned with eo by reflecting it across the line that bisects the 
angle between the sides z and ceo. This construction provides a geometric interpretation for the 
Householder transformation. 


transformation (34.9) is to employ a Householder reflection. We motivate it by means of a 
geometric argument. 

Thus, refer to Fig. 34.3, which shows the vector z and its destination aeo. Since a = 
, the triangle with sides z and aeg is isosceles and we denote its base by 


E 
g-—2z-— Qep 
If we drop a perpendicular from the origin of z to g, it will divide g into two equal parts, 


with the upper part being the projection of z onto g and is equal to zg" ||g|| ?g. 
This means that g = 22g ||g|| -?g and, consequently, 


g'g 
aeg = z—2zg (gg) |g—z i — 224] (34.10) 
e 
where 
^ g'g 
© = 1I- 2— (34.11) 
99 


We thus have a matrix © that maps z to aeg. It is straightforward to verify that this © is 
orthogonal and involutary, as desired. The matrix © so defined is called an (elementary) 
Householder transformation or reflection; it is a reflection since its effect is to reflect z 
across the line that bisects the angle between the sides z and wey — see Fig. 34.3. More- 
over, for the Householder matrix © in (34.11), we see that it is a rank-one modification of 
the identity matrix. Therefore, it has (n — 1) eigenvalues at 1 and a single eigenvalue at 
—1. Therefore, det © = —1, which again confirms that it is a reflector. In summary, we 
established the following result. 


Lemma 34.3 (Real Householder reflection) Consider an n—dimensional vec- 
tor z with real-valued entries. Choose g = z + ||zl|eg and 


to get zO = c |[z||eo. Here, eo = [10 0... 0] is the first basis vector. 
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Usually, the sign in the expression for g is chosen to be the same as the sign of the 
leading entry of z in order to avoid a vector g with small Euclidean norm. 


Complex data 

More generally, consider a 1 x n vector z with possibly complex entries, and assume that 
we wish to determine a transformation © that transforms it to the form (34.9) with a scalar 
o that is possibly complex-valued, and where © is required to be a Hermitian unitary 
matrix, i.e., it should satisfy O = O* and O? = I. 

Again, the scalar o cannot be arbitrary and its value can be determined even before 
determining the expression for a matrix O that achieves (34.9). Indeed, note from (34.9), 
and from the unitarity of ©, that |a|? = |[z|? so that |a| = ||z||. Moreover, from the 
equality zOz* = az, and from the fact that 202" is real (since © is Hermitian), we 
conclude that azġ must be real. If we introduce the polar representation of the first entry 
of z, namely, 2(0) = |a} e7%, then it follows that a is given by 


a = ||z\|e"% 
The prior geometric construction of © can be repeated in the complex case and it leads to 


the following conclusion. 


Lemma 34.4 (Complex Householder reflection) Consider a 1 x n vector z 
with possibly complex-valued entries. Choose g = z + ||z||e?« eg and 


to get 20 = T.|[z|]e/?«e. Here,eg = [10 0... 0] and ġa is the phase of the 
leading entry of z. 


Proof: Besides the geometric argument, we can establish the result of the lemma algebraically as 
follows. We express g as g = z + aeo, where o is a scalar that satisfies |a|? = [[z||? and aa” is 
real. Then direct calculation shows that 


IIgll? 
2g'g 


2|z||? + 2aa* 
z|izl|? + allz|| eo + a*az + a(aa”)eo 


and, hence, 


—(Q€0 


Example 34.2 (Using Householder transformations) 


We re-consider the pre-array of numbers A in (34.7) and now show how to transform it to the form 
(34.8) by means of Householder transformations. 


Let xı denote the top row of A, i.e., 21 = [ 1 0.75 0.25 | . Our first step is to annihilate the 
last two entries of zı. From expression (34.11), we know that this transformation can be achieved 


by using the following 3 x 3 Householder transformation: 


8; =I3-—2 gin 
2191 


where gı = 21 + |[z1| [ 1 0 0 |. We initially choose the sign in the expression for g; to be the 
same as the sign of the leading entry of zı, which is positive, so that 


n=[1 075 0.25 ] + [ 12748 0 0 [ =Í 2.2748 0.1500 1.0000 | 


Applying ©; to the two rows of A gives 
a 
n0: = 21-2 ea = [ -1.2748 0 0] 


and 


= 
z201 = 22 E TH = [ —0.4707  —0.0871 0.1043 ] 


In other words, 


04 02 02 —0.4707  —0.0871 0.1043 


——_ ——— ————M 
A Ai 


| 1 0.75 0.25 le B | -12748 0 d 
= 


Of course, had we chosen the sign in the expression for gi to be the negative sign, the signs of all 
entries in the above post-array would have been switched, say, 


1 0.75 0.25 ER 1.2748 0 0 
04 02 02 0.4707 0.0871  —0.1043 
A Ai 


In order to annihilate the (2, 3) entry of A: we can replace the vector [0.0871 0.1043] by one of 
the form [a 0], where a is the norm of the vector, i.e., by [0.1359 0]. In this way, it is seen that the 
resulting post-array in (34.8) can be taken as 


1.2748 0 0 
0.4707 0.1359 0 
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QR and Inverse QR Algorithms 


l. preparation for our discussions of RLS array methods, we now recall briefly the defining 
equations of RLS for ease of reference. 

Consider a collection of data (u;, d(7) o, where the (u;) are 1 x M and the {d(j)} 
are scalars, in addition to an M x 1 column vector i», an M x M positive-definite matrix 
II, and a scalar A satisfying 0 « A < 1. Then the solution, wy, of the regularized least- 
squares problem 


N 
min | AU *D (w — a)*Tl(w — 4) + XO AY dl) — ujwp? (35.1) 
j=0 


and the corresponding minimum cost, €(V), can be obtained recursively as follows (recall 
Alg. 30.2). Start with w_1 = ð, P. , = II71, £(—1) = 0, and repeat for i > 0: 


Vi) = 1/1 +A ui Pi-147) 

g em A Rae 

e(i) = d(t) — UjWi-1 

V; = wiicgie(i) (35.2) 
B, = X!B.- gigt/n(i) 

r(i) = d(i) — Wi 


Eli) = A&(i—1) + r(i)e*G) 
Moreover, the following relations also hold at each iteration 4: 
y(i) 21-wPui, gi = Pru; (35.3) 


We further remarked following Alg. 30.2 that the (wj) also satisfy the following construc- 
tion. Start with w_1 = Ù, s_, = 0, 6. , = II, and repeat for i > 0: 


9; = A0, i + Ur Us, Si = ASi-1 + u; [d(z) = uu (35.4) 

Then, at each iteration 7, it holds that 
ilw; — 1] = ŝi (35.5) 
yi) = 1-u$;lu (35.6) 


because, by the definition (30.30), 6; = p Equations (35.4)-(35.6) will be significant 
in this chapter in that they will form the basis for one of the most celebrated array variants 
of RLS (see Sec. 35.2). 
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35.1 INVERSE QR ALGORITHM 


The first array algorithm we derive is known as the inverse QR algorithm (and also as 

the square-root RLS algorithm). It is based on the observation that the expressions for 

(n7 (i), gi) in (35.2) can be put into the desirable forms (33.19) and (33.20). Indeed, note 
l4 Au; P iu 


that 
{ yi) 
gi (i) ACPLaur 
Comparing with (33.19) and (33.20) we see that we can make the identifications: 


(35.7) 


CH 7?) Ac1  BcAMVh p? 
Fegi) D-0 EcAW2p/? 


where Pl i : denotes the Cholesky factor of P;..,. In other words, the expression for ^7! (i) 


in (35.7) corresponds to the norm preservation identity (remember that ^/(7) is a real vari- 
able even for complex data): 


—1/2 . 1 
-1/2(; Y (i) = -1/2,,, p1/2 
Lr) 0j | 0 | [1 aup: | AMI Pil uy 


whereas the expression for g; y7! (i) in (35.7) corresponds to the inner-product preservation 
identity: 


-1/2/; 1 
7/26 Y (i) = —1/2 pl/? 
Lov ^) x | | 0 | [o o8 | AME prays 


Therefore, motivated by the discussion that led to (33.21) from (33.19)-(33.20), we let O; 
be a unitary matrix that transforms the pre-array 


= 1/2 
1 AMA pl? 


A= = 1/2 
0 À 1/2 p1/2 


to lower triangular form, say, 


for some variables ( x, y, Z} to be determined, i.e., 


= 1/2 
| je Mos 
Q0 AME, 


— |x. 0 
6; = [> A (35.8) 


where x is a scalar, y is a vector, and Z is a lower-triangular matrix with positive-diagonal 
entries. Observe that the pre- and post-arrays in (35.8) have the forms (assuming M = 3): 


respectively. We already know from the explanation in Sec. 33.4 that the values of x and 
y in (35.8) should be 
x-y P), y= gi (i) 
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Alternatively, we can identify the values of ( x, y], and also the value of Z, by "squaring' 
both sides of (35.8) to obtain the equality: 
-|x 0 x Oo] 
y Zi]iy Z 


| 1 Au pl? | 1 Au pr? 
Comparing terms on both sides we find that the following equalities must hold: 


= 1/2 
0 AM pA 


0 A-U2pl? 


| x | = 1+ Alu; Put 
yx* ACUP pdt 
yy*+ZZ* = ATIP; 


From the first equation we get | x |? = y7} (i) so that x = y71/2(i). From the second 
equation we get y = g;"y- 1/2 (i), and from the third equation we get 


ZZ* = ABL.,-yy = AP; -gigi /yli) = P; 


where the last equality follows from recursion (35.2). Therefore, since Z is lower triangu- 
lar with positive diagonal entries, we conclude that Z is the Cholesky factor of P;. We thus 
find that, in addition to the quantities (^^! (i), gy7} (i)}, the array algorithm (35.8) also 
provides the square-root factor pi! ? which is needed to form the pre-array for the next 
time instant. In this way, we obtain a self-contained array method where the quantities that 
are needed to form the pre-array are obtained in the post-array and the procedure can be 
repeated. In summary, we arrive at the following algorithm. 


Algorithm 35.1 (Inverse QR) Consider data (u;, d(7)) ^o, where the u; are 
1x M and the d(j) are scalars. Consider also an M x 1 vector Ñ, an Mx M 
positive-definite matrix II, and a scalar 0 « A < 1. The solution, wy, of the 
least-squares problem (35.1) can be computed recursively as follows. 

Let X = II-! and introduce the Cholesky decomposition X = X1/2y7*/2, 
where X!/? is lower triangular with positive-diagonal entries. Then start with 
w_1 = d, PV? = 5/2, and repeat for i > 0. 

1. Find a unitary matrix ©; that lower triangularizes the pre-array shown 

below and generates a post-array with positive diagonal entries. Then 
the entries in the post-array will correspond to 


a=] gaa). . 0 | 


= 1/2 
1 ACA? » 
gry /?(i) P 


= 1/2 
o dep 


2. Update the weight vector as 
-1 
wi = wii CAO] ho] [d(i) — uiwi-1] 


where the quantities {g;y~1/?(i), y~1/?(i)} are read from the post- 
array. 


The computational complexity of this algorithm is O(M?) operations per 
iteration. 


Remark 35.1 (Terminology) The above array algorithm is known as the inverse QR method in 
the adaptive filtering literature for the following reason. The qualification inverse refers to the fact 
that pi ? isa square-root factor of the inverse of ; in (35.4). Just note that since, by definition, 
ð, = PL! and P; = P)/?P*/?, we have ®7! = PI? P", The qualification QR, on the other 
hand, is because this array method relates to the QR decomposition of a matrix, as explained in the 
third remark below. The algorithm is also known as square-root RLS since it propagates the square- 


root factor PY 2. this latter terminology is borrowed from Kalman filtering — see Prob. VIII. 12. 


Remark 35.2 (Reliability) By propagating a square-root factor of P;, rather than P; itself as in 
a plain RLS implementation, the danger of having P; lose its positive-definiteness due to numerical 
inaccuracies is essentially eliminated. If needed, P; can be recovered by squaring, i.e., via P; = 
p! 2 p*/ ?. However, this step is not necessary since the array method already evaluates the gain 


vector g; and, moreover, the entries of its post-array contain everything we need in order to proceed 
to the next iteration. 
o 


Remark 35.3 (Relation to the QR decomposition) If we conjugate-transpose the array equa- 
tions of Alg. 35.1 we get 


H 0 
| AT pHi a 


where the rightmost array is upper triangular with positive diagonal entries. Since ©, is unitary, this 
way of expressing the array equations amounts to a QR decomposition (cf. Sec. B.5) of the matrix 


1 0 
35.9 
| aar aus] i 


where the matrix ©; plays the role of the Q factor (recall the defining expression (B.9)). This 
observation shows that an alternative way of implementing the array algorithm is by performing the 
QR decomposition of the matrix (35.9). 

o 


Remark 35.4 (Further motivation) The inverse QR algorithm could have been alternatively de- 
rived as follows. Starting from the equations for (gi, y(i), Pi} in (35.2), we note that they can be 
combined together in factored form as follows: 


1 ADM pl? 1 0 , 
[a eap? | [cunis aie ]" 
7/24) 0 yli) gry) 
Lm [TE tt" 


1 


To verify that this is indeed the case, simply expand both sides and compare terms. Now this equality 
fits precisely into the statement of Lemma 33.1 by choosing 


1 ATu P? yai 0 
^ capi | mé B=| gata) pia 


so that there should exist a unitary matrix O; that takes A to B. 
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35.2 QR ALGORITHM 


The second array algorithm we consider is known as the QR algorithm (and also as the 
square-root information RLS algorithm). It follows from the observation that the recursions 
for (i, s;} in (35.4) can be put into the desirable forms (33.19) and (33.20). 


Indeed, let a ? denote the Cholesky factor of ®;, and introduce the auxiliary signals 


d) È di)- u, ü; Ê wi (35.10) 
as well as the auxiliary column vector 


qi 5 el" lwi- à) (35.11) 


These auxiliary signals would reduce to d(i) and 6*/?w;, respectively, when i? = 0, which 
is usually the case. We consider w for generality. 
From the equation for w; in (35.5), it is seen that q; satisfies 


6/79; = s, (35.12) 


With {d(z), w;, qi} so defined, we can rewrite the recursions in (35.4) as 


1/25 */2 " " 

aor? = ADAD utu (35.13) 
1/2 = 1/2 "T : 

È; “qi = Aéjigii + Uj d(i) 

If we conjugate the second equation, we get 

1/2 + */2 1/2 + */2 * 

ev a = Aej 1971 + UF Ui (35.14) 
* &* * * Je: . 

qi 9; - Ag 874 + d*(i)ui 


Comparing with (33.19) and (33.20) we see that we can make the identifications: 


{ Cee? ASAE Beut 
Feqgq Der, Ee d*(i) 


In other words, the expression for ®; in (35.14) corresponds to the norm preservation 
identity: 


[on o] | zi | -[x59 ut | | Mn | (35.15) 


Ui 


whereas the expression for s; in (35.14) corresponds to the inner-product preservation iden- 
tity: 
*/2 - 1/29,*/2 
Lap x] | ss: | ep Avr. Sd] | A Wa | (35.16) 
i 


Therefore, motivated again by the discussion that led to (33.21) from (33.19)-(33.20), we 
let ©; be a unitary matrix that transforms the pre-array 


1/2 x 
A= | Ar Ui | 
Avg: d*() 


to lower triangular form, say, 
B- | X 0 | 
y z 
for some variables {X, y, z} to be determined, with X lower triangular with positive diago- 
nal entries, y a row vector, and z a scalar, i.e., 


2D u | E 4 
= = O; = 35.1 
bes t (i) y 2 aa 


Observe that the pre- and post-arrays in (35.17) have the forms (assuming M = 3): 


respectively. Therefore, the purpose of O; is to introduce the zeros in the last column and to 
generate a lower triangular post-array with positive diagonal entries. This can be achieved, 
for example, by using Givens rotations and pivoting with the diagonal entries of \1/? 9l / 3 
— as explained in App. 34. 

Continuing with the array description (35.17), we already know from the explanation 
in Sec. 33.4 that the values of X and y in (35.17) should be X = 91^ and y = qj. 
Alternatively, we can identify the values of (X, y}, and also z, by “squaring” both sides of 
(35.17) to obtain the equality: 


AMT uf wags we S Dx OT | x0) 
Aala d*(i) Agr, d*(i) y ZJLĻY Z 


Then comparing terms on both sides we find that the following equalities must hold: 


XX* = AO 1+ uiti 
yX* = Asj_, + ud" (i) 
yy* +zz* = Meal? + Id? 


From the first equation we get XX* = ;. But since X is lower triangular with positive 
diagonal entries, we conclude that X is the Cholesky factor of ®;, namely, 


x29 
From the second equation we get yo?! 2 s% so that, from (35.12), 
y — di 
Thus, so far we have established that the array algorithm (35.17) leads to a recursion of the 


à / 1/2 1/2 
AVOID ag | [* A 
Ti m O; = 3, 35.18 
Ag d'(i) go z cd 


Note in particular that the quantities (qi, g! / 7) that are needed to form the next pre-array 
are available in the post-array and, hence, the procedure can be continued. However, it is 
useful to persist on a complete characterization of the variable z. The identification of z 
requires more effort. 
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From the third equation we find that the scalar z satisfies 


zt = Allqsll?+ EON - |I? 
Allgi-il? + la)? — Weill? 


Substituting q; and q;..1 by their definitions (35.11), namely, 


x/2 — 2: — 
Qi = 97g, qi-1 = 6. Di 


we obtain 
zz* = Ad; 0; 191 — Ür Dj; + ld(i)? 


or, equivalently, 
zz' = Mi? 5i 1 — 8*0; + |d(i)|? 


since, from (35.4), 6,1; = s; and $; 44; 1 = s; 1. Using the time-update relation for 


si from (35.4) we find that 


(i) — uu] 
( 


zz = JA(Uj si :i— Sw) + d'(i)| 
A(Gjisi-i:— sj.,U;) + d'(i)|d 
A(Dfsi-i— sf.,UX) + d'(i)r(i) 


in terms of the a posteriori error, 

r(i) = d(i) — uw; 
However, from 

®;_1Wj-1 Si. 
we have 

Asi. Wi = AU; , D; 10; 
so that 
A(T} 18-1 - Sj (Ui) = 17 i1[Asi-1 — Ao, uj] 
= 0; [Asi — (9; — uj u;) wi] 

v; 1[Asi-1 — Si uz u;W,] 
0; 4 [-ufd(i) + ulus] 
—üj uj[d(i) — uw] 


—üj ujr(i) 
Substituting into (35.19) we find that 
z* = (d(i) — ui; 1]*r(i) 


= [d(i) — u;wi-il"r(i) 


or, equivalently, 
zz' = |z|? = e*(é)r(i) = lefi)? y(i) 


in terms of the a priori error 
e(t) = d(i) — UjWi-1 


(35.19) 


(35.20) 


and the conversion factor y(i). Therefore, from the arguments presented to this point, we 
can only identify the magnitude of z, namely, 


lz} = Jeli) (i) (35.21) 


More information is needed in order to identify the phase of z. To do so, we first note that 
the arrays (35.18) can be expanded in order to allow for the evaluation of the conversion 
factor as well. Specifically, by incorporating the row vector [0 1] into the pre-array in 
(35.18), we obtain an array description of the form: 


Meg ut a? 0 
Agra PaA |S. = | g oz (35.22) 
0 1 t s 


for some row t and scalar s to be identified. Clearly, the value of s agrees with the rightmost 
diagonal entry of ©,, and this entry can be enforced to be positive — see Prob. VIII.6. 
Equating the inner product of the top and last rows on both sides of (35.22) we get 


[o TIS zo 3b d 


Ui 
. */2 
or, equivalently, u; = t®;’“, so that 
t = uw”? 


Likewise, equating the norms of the last rows on both sides of (35.22) we get 


[o 1]| 2] Ee s1[£] 


so that, from the just derived value for t, 
12 u;9; uf + |s|? 
Using expression (35.6) for y(i}, and the fact that s is positive, we can identify s as 
s = 7'/2(i) 


Finally, equating the inner product of the last two rows in (35.22) we get 


sup] ofa] 


di) = (uz?) qi + sz" 


35.12 = s\n k 
CED wu, —u] + 4!" (iz 


or, equivalently, by using d(i) = d(z) — uŭ, 


d(i)- uw; = y? (i)z = r(i) (35.23) 
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which allows us to identify z as 
z= (i) Pl) 


This expression is of course consistent with (35.21). 


Algorithm 35.2 (QR algorithm) Consider data {u;,d(j)}j.9, where the uj 
are 1 x M and the d(j) are scalars. Consider also an M x 1 vector i», an 
M x M positive-definite matrix II, and a scalar 0 « A < 1. Let d(i) = 
d(i) — uj. The solution, wy, of the least-squares problem (35.1) can be 
computed recursively as follows. 

Start with a? = II!/2, q. , =0, and repeat for i > 0. 


1. Find a unitary matrix ©; that lower triangularizes the pre-array shown 

below and generates a post-array with positive diagonal entries in $7 
as well as a positive rightmost corner entry in the last row of the post- 
array. The entries in the post-array will then correspond to 


AM? ys 97 0 
Age. d'()|9i- qi e* (iy? (i) 
0 1 u$; ? — M3) 


2. Obtain w; by solving the triangular system of equations oF y, -G) = 
gi, where the quantities (oo, qi} are read from the post-array. 


The computational complexity of this algorithm is O(M?) operations per 
iteration. 


This algorithm is also sometimes referred to as the square-root information RLS algo- 
rithm; a terminology that is borrowed from Kalman filtering — see Prob. VIII.13. 


35.3 EXTENDED QR ALGORITHM 


The QR algorithm of the previous section determines the weight vector w; by solving the 
triangular linear system of equations $7 / ? lw; — dj] = qi. An alternative procedure that 
avoids this step can be obtained by expanding the array equations even further. Starting 
from the RLS update equation (cf. (35.2)): 


Wi = wii + gieli) (35.24) 


and writing it as 


Drq = Otai + gieli) (35.25) 


we find that (35.25) justifies the inner product equality 


aane o] | Mab" | = are -ra || gating | 


This equality suggests that we can extend the array equations of Alg. 35.2 as follows: 


AM2917 ut gi? 0 
AMAge a dli) B qg — elil) 
: V [e = uai hq) (35.26) 
Spc 0 R m 


for some matrix R and column vector m to be determined. As usual, by equating the inner 
products of the first and last rows on both sides of (35.26) we get I = Ro; /? so that 
R24, */? Likewise, equating the inner product of the third and fourth rows on both 
sides of (35.26) we see that we must have 0 = ; ‘uy + my!/"?(i). But since 6; ! = P, 
and P;u; = gi, we get 

m = gy V" (i) (35.27) 


In summary, we are led to the following algorithm. 


Algorithm 35.3 (Extended QR) Consider the same setting of Alg. 35.2. The 
solution wy can be recursively computed as follows. Start with 62? = IT1/2, 


or? = II7*/2, q-ı = 0, and repeat for i > 0. 


1. Find a unitary matrix ©; that lower triangularizes the pre-array shown 
below and generates a post-array with positive diagonal entries in o 
as well as a positive rightmost entry in the third (block) row of the post- 
array (the entry corresponding to ^;!/?(i)). The entries in the post-array 


will then correspond to 


AD ys 9i? 0 
XPa dH) lo | «, Ore 
0 1 i uw P P 
ao o0 Q7 -gy (i) 


2. Obtain w; recursively via 
Wi = Wi-1 ~ (-sn7!)) (ety q) 


The computational complexity of this algorithm is still O(M?) operations per 
iteration. 


Observe that the last row in the above array equations resembles the last row of the 
inverse QR method of Alg. 35.2. However, since the pre- and post-arrays in the extended 
QR array algorithm propagate both e ? and its inverse, this method can face numerical 
difficulties in finite-precision implementations. 


35.A APPENDIX: ARRAY KALMAN FILTERS 


The idea of using array algorithms to propagate square-root factors, rather than the matrices them- 
selves, has originated in the context of Kalman filtering. The arguments used in the body of the 
chapter to derive array methods for recursive least-squares are similar to what is needed to derive 
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array methods for Kalman filtering. For this reason, we shall be brief. Once the array methods are 
presented, the reader will then be able to recognize the similarities between the RLS and Kalman 
filtering domains (by simply working out Probs. VIII.12 and VIII.13); this conclusion is of course 
expected in view of the already established equivalence result of Sec. 31.2. Thus, refer to the material 
on Kalman filtering in Chapter 7 and consider again the standard state-space model (7.8)-(7.10): u 


Zi+ı = Fixi + Gini, i20 
y; = Hizi + vi 
ni * Qi 0 0 
i ôi l 

a nj | 0 Ri | i o (35.28) 
E vj = 

Zo 0 0 IIo 

Lo 
1 0 0 0 


The Kalman filter recursions, in covariance form, for computing the innovations and, subsequently, 
predicting the state vector are given by (cf. Alg. 7.1):'8 


Rei = Roc HiP- H? 
Kpi = Fi Pin. H} R7} 
Vi = Y; — Hidii-i (35.29) 
ii = Fiĝip-1 + Kpivi 
Piip = FiPiu-1Ff + GiQiG} — Kpa Re i Kp, 


with initial conditions ĉoj—ı = 0 and Poj—ı = Io. The time- and measurement update forms of the 
Kalman filter are similarly given by (cf. Alg. 7.2): 


Rei = Rit HP- H? 
Kyi = Py Hi Ro; 
Vi = Y; — Hifi- 
Êi = Êi- +K favi (35.30) 
ĉii = o Figs 
Pu = Piui Piu-1 HERZ Hi Piys-1 
Pagi = FPF + GiQiGi 


Time-update array form. An array algorithm is evident for the time-update problem. Indeed, from 
the identity Pj 14; = Fi Pi, Fë + GiQiGi, we can express P;41); as 


Pri = [ ppl Gig? | [ FP}? aQ”? | 


ii i ili 


This suggests the following array construction. Let Of" be any unitary matrix that triangularizes the 
pre-array shown below: 

[ Rey? ear Jer [x o] 
where X is lower-triangular with positive diagonal entries. Then, by comparing entries on both sides 
of the squared identity, 


[are eq lorem are cai? -[xo][x oJ 


=I 


'7We shall assume, without loss of generality, that S; = 0, i.e., that the processes (r4, v;) are uncorrelated. 
This case is sufficient for our purposes here. 

18For convenience of exposition, we are denoting the innovations variable of the Kalman filter by v(i), as opposed 
to the symbol e(2) used in Chapter 7. This is in order to avoid a conflict of notation with the symbol e(i) used to 
denote the output error of RLS. 


we conclude that XX* = Pj; so that X = P In summary, we are led to the following 


time-update array form: 


(35.31) 


where O/" is any unitary matrix that triangularizes the pre-array. 


Measurement-update array form. For the measurement-update problem, we choose any unitary 
matrix Q7"" that triangularizes the pre-array below: 


1/2 1/2 
R? CHUA om |X 0 
0 pi? í Y Z 


ili—1 


where X and Z are lower triangular with positive diagonal entries. Then, by comparing entries on 


both sides of the squared identity: 
R^ HPY, m^ Ea ee xc 
Qr. Rey Y^ xz 


ili- 


0 pi? 


em em 
1 1 


we arrive at the identifications shown in the following measurement-update array form: 


1 


0 pi? 


ili-1 


e,i 


Kr pi? 


ili 


(35.32) 


1/2 .pi/2 1/2 
R; HiP; | or" - R 0 | 


where Kj = Kj, RV?. 


e,i 


Array covariance form. By combining the just derived array forms for the measurement- and time- 
update steps we obtain the following array form for the covariance filter, also known as the square- 
root form, 


pe o "T ue ia . (35.33) 


ii-1 


| RP gpl 


where Kyi = K, pi RY? and O is any unitary matrix that triangularizes the pre-array. Alternatively, 
we can deduce the same algorithm as follows. Let ©, be any unitary matrix that triangularizes the 
pre-array shown below: 


0 FP? GQ”? $77 


iji-1 i 


where X and Z are lower triangular with positive diagonal entries. Then, by comparing terms on both 
sides of the squared identity 


1/2 1/2 1/2 .pi/2 Y 
Ri Hi Pi- 0 9,07 Ri AP 0 
0 ER GiQ;” Rui qut 0 EB G,Q1^? 


=I 
ese oe oe kl se es a]? 
Y Z 0 Y Z 0 
we arrive at the aforementioned array covariance form. 


information forms of the Kalman filter. Starting from the covariance and time- and measurement- 
update forms (35.28) and (35.30) of the Kalman filter, some algebra will show that, when F; is 
invertible, the following forms also hold: 
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e Measurement-update information form: 
Pu = Ppt AiR, Hi 
Pr Bi = Pj aia + HIR ‘y; 


ili 


* Time-update information form: 


Big = o Fg 
Re, = Q4GiE- MFG 
Ki, z= EU iud uc cm 
Pi = FÉCPSE S -KQBnEKS Paci 


e Recursion for the inverse Riccati variable: 


Pu = F PaF + ROAR HQ — Kp, RisKei, Phl = Tp" 
These variants are called information forms since they propagate the inverses of the error covariance 
matrices, (P5 ,, Pj). We can devise array formulations for them as well. The arguments are 
similar to what we have done so far and, therefore, we shall only state the final recursions; their 
validity can be verified by simply “squaring” both sides of the corresponding array equations and 
comparing terms. 

Let jii = p. 1 and È; = P7}, and introduce their Cholesky factorizations, 


ili- ili? 


$jii = Q7 ot? Biji = $129*/? 


ii-1*ii-1! RETE 


where (9i 2 d ?) are lower triangular with positive diagonal entries. 


ii-l) “ale 


Measurement-update information array form. 


euam. wR? | $1945 VIRG” l 
where Q7"” is any unitary matrix that lower triangularizes the pre-array. 
Time-update information array form 
a" quy ey ef = an : (35.35) 
0 Fee | | PBE aV" e l 


where Of" is any unitary rotation that triangularizes the pre-array and generates a lower triangular 
factor vii ? with positive diagonal entries. Here, V; = Q7 1 + Gt A,G,. The combination of (35.34) 
and (35.35) is referred to as the square-root information filter (SRIF). 


Information array form 


Fre? RU HEROMT (*) Dilin 0 


i i|i-1l 


$a, i £u ©: = | (+) Pia viz; (35.36) 


* 1/2 1/2 
0 () Koa, Z 


where ©; is any unitary rotation that lower triangularizes the pre-array. Moreover, zi 7 isa square- 
root factor for Rz 1, i.e., RA - z / “Zr /? and the “(*)” notation indicates “don’t care” entries. 


e,i’ 


Summary and Notes 


The chapters in this part developed three array variants for recursive least-squares solutions: in- 
verse QR, QR, and extended QR (also known as square-root RLS, square-root information RLS, and 
extended square-root information RLS, respectively). 


SUMMARY OF MAIN RESULTS 


1. Array methods are based on transforming a pre-array of numbers into a post-array of num- 
bers by means of unitary transformations. Such transformations are easy to implement as 
a sequence of elementary rotations or reflections (e.g., as Givens rotations or Householder 
reflections). 


2. Array methods are self-contained in that quantities that are needed to form the pre-array are 
propagated in the post-array. 

3. The variables involved in array methods are usually square-root factors whose entries assume 
values within smaller dynamic ranges. As a result, array methods are more reliable and have 
better numerical properties in finite-precision arithmetic than a direct RLS implementation — 
see, e.g., the computer project at the end of this part. 


4. The array variants developed in this part have the same computational complexity as RLS, 
namely, O(.M7?) operations per iteration. 


BIBLIOGRAPHIC NOTES 


QR methods. It is generally accepted that the QR method for solving and updating least-squares 
solutions is among the most reliable procedures for finite word-length implementations. The origin 
of the method can be traced back to the works of Householder (1953, pp. 72-73), Golub (1965), 
and Businger and Golub (1965) on least-squares problems — see Probs. VII.12 and VIL32 for a 
description of the QR method as usually derived in works on computational linear algebra. 

In the context of adaptive filtering, simulation studies in Yang and Bóhme (1992) suggest that 
the QR method of Sec. 35.2 is reliable numerically for A < 1 and can diverge for à = 1. This is 
because the mechanism for the propagation of a single numerical error is exponentially stable for 
À « 1. Further experimental evidence for the numerical stability of the QR method can be found in 
Ward, Hargrave, and McWhirter (1986). However, as mentioned in the concluding remarks of Part 
VII (Least-Squares Methods), such conclusions on the numerical stability of RLS-type algorithms 
from single-error propagation models should be interpreted with care. 

The extended QR method of Sec. 35.3 was developed by Yang and Bóhme (1992), and it also 
appears in Sayed (1992) and Sayed and Kailath (1994b). Unfortunately, the algorithm can face 
numerical difficulties in finite word-length implementations since its pre- and post-arrays involve 


both ð /? and its inverse. 


Systolic implementation. Array methods can be parallelized in view of the fact that the rotations 
can be applied simultaneously to all rows in a pre-array. In particular, for the QR method of Sec. 35.2, 
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a systolic array implementation was developed by McWhirter (1983) in the form of a triangular 
array of processors. Related descriptions can be found in McWhirter and Proudler (1993), Shepherd 
and McWhirter (1993), Sayed, Lev-Ari, and Kailath (1994), and Haykin (1996,2002). Parallel and 
systolic array implementations for the inverse QR algorithm of Sec. 35.1 can be found in Pan and 
Plemmons (1989) and Alexander and Ghirnikar (1993). 


Basis rotation. The idea of using the basis rotation result of Lemma 33.1 as a tool for deriving array 
RLS methods is due to Sayed and Kailath (1994b). The result of the lemma provides a convenient 
way for motivating and describing array methods (see, e.g., the presentation in Haykin (1996,2000) 
and also Manolakis, Ingle, and Kogon (2000). 


Array methods in estimation. The idea of using array algorithms to propagate square-root fac- 
tors, rather than the matrices themselves, dates back to the mid 1960s in works on Kalman filtering 
by Potter and Stern (1963). The need for introducing such array methods was sparked by the neces- 
sity to develop reliable Kalman filtering implementations for precision approach and moon landing 
systems. However, this initial work was limited to the measurement-update step of the Kalman fil- 
ter (cf. Chapter 7), and extensive subsequent developments followed that extended array methods to 
different forms of Kalman filtering implementations (for both filtering and smoothing applications). 
Among these earlier works we may mention those of Dyer and McReynolds (1969), Hanson and 
Lawson (1969), Schmidt (1970), Bierman (1974,1977), and Morf and Kailath (1975). The last two 
references are notable for their generality, with the work by Morf and Kailath (1975) being the clos- 
est to the array descriptions in this chapter — see App. 35.A and Probs. VIIL.12 and VIIL13. 


RLS and Kalman filtering. Since RLS can be regarded as a special case of the Kalman filter for a 
special state-space model, as was shown in Sec. 31.2, then there should also exist a correspondence 
between the RLS array algorithms of this chapter and the Kalman filtering array algorithms. This 
correspondence was developed by Sayed and Kailath (1993,1994b) and it is detailed in Chapter 31, 
Apps. 35.A and 37.A, and Probs. VIIL12 and VIII.13 — it is also covered in Haykin (1996,2000). 


Paige's method. There have been several interesting works in numerical linear algebra on stable 
array methods for recursively updating least-squares solutions. These methods are relevant to the 
adaptive filtering context, and most notable among them is an array algorithm developed by Paige 
(1985) as a result of studies by Paige (1979a,1979b) and Paige and Saunders (1977). Paige's form 
is useful when dealing with ill-conditioned data. Rather than propagate a square-root factor of Pj, 
the algorithm propagates factors of pi ? itself. In this way, it avoids matrix products while forming 
the pre-arrays and leads to improved (in fact, stable) numerical performance albeit at some increased 
computational cost. A description of the algorithm appears in the textbook by Kailath, Sayed, and 
Hassibi (2000, Chapter 12). Paige proved that his algorithm is “backward” stable, which is a desirable 
feature. What this means is the following. 

A numerical algorithm for solving a linear system of equations Az = b is said to be backward 
stable if the computed solution Z can be shown to be the exact solution of a slightly perturbed system, 
namely, Z satisfies (A + 6A)Z = (b + 6b) for small perturbations {6.A, ôb}. 


Problems and Computer Projects 


PROBLEMS 


Problem VIII.1 (Rank-one modifications) Let X = I — ozz^, where x is a column vector, a 
is a scalar, and I is the identity matrix. 

(a) Show that XX* = I — Gra", for some real number 8. 

(b) Consider a matrix Y of the form Y = I — Gra”, for some real scalar 9. For what condition 


on @ is Y positive semi-definite? When this is the case, show that Y admits a Hermitian 
square-root factor of the form Y /? = I — oz" for some real scalar o. 


Problem VIII.2 (Basis rotation) In this problem we provide another proof for Lemma 33.1 by 
resorting to the QR decomposition of a matrix rather than its singular value decomposition. The 
SVD proof given in the text is more general since the argument in this problem assumes that the 
matrices A and B have full rank. 

Introduce the QR decompositions 


E s =a | ^e | 


where Q 4 and Qs are M x M unitary matrices, and RA and Rg are n x n upper triangular matrices 
with positive diagonal entries (due to the full rank assumption on A and B). 


(a) Show that AA* = R4 Ra = R5 Bp. 


(b) Conclude, by uniqueness of the Cholesky factorization (cf. Sec. B.3), that Ra = Rp. Com- 
plete the argument to show that the matrix © = QgQ^ is unitary and maps B to A. 


Problem VIII.3 (Sample covariance matrix) Let 6; = ies A‘ u$u;, where the uj are 1 x 


M row vectors and 0 «& A < 1. The matrix 9; can have full-rank or it may be rank deficient. 
Assume its rank is r € M. Let $i /? denote an M x r full-rank square-root factor, i.e., ®; /? has 


rank r and satisfies plg? = 6, Show that uj belongs to the column span of $1, 
Problem VIII.4 (Rank-three update) Consider a recursion of the form 
$; =AG;_) +a*a +b*b+c*c 


where A is a positive scalar and (a, b, c} are row vectors. 


(a) Let P; = 9; !. Find a procedure that updates P;..; to P; and that does not require any matrix 
inversion; only scalar inversions are allowed. 


(b) Derive an array algorithm for updating ®; / ? 106] 2, 
(c) Derive an array algorithm for updating P, / ? to PM i 


Problem VIII.5 (QR method for a modified cost) Consider the formulation of Prob. VIL30. 
Repeat the arguments of Sec. 35.2 to derive the following QR-based implementation. Start with 


o = I2, q-1 = 0, and repeat for i > 0. 
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PROBLEMS in the last row of the post-array. The entries in the post-array will then correspond to 
PET a 9i? 0 
Padia dü)|e-]| d e(Gwv^) 
0 1 u$; y? (i) 


2. Obtain w; by solving the triangular linear system of equations 7 Hu = qi, Where the 
quantities (9; n qi) are read from the post-array. 


Problem VIil.6 (Rotation matrix for QR algorithm) Refer to the discussion in Sec. 35.2 on the 
QR algorithm and recall that the purpose of the transformation ©, in (35.17) is to perform a trans- 
formation of the form (assuming M = 3): 


Aala 


One way to implement ©; is via a sequence of three elementary Givens rotations in order to annihilate 
the three entries of uj, one at a time. Now since, by assumption, the diagonal entries of $i i are 
positive, we have that the diagonal entries of the triangular matrix in the pre-array are positive. 
Recall further that we desire a post-array in (35.8) having a $i 7? with positive diagonal entries as 
well. Therefore, in view of the remark following Lemma 34.2, the diagonal entries of the individual 
Givens rotations will be positive. Use this fact to conclude that the rightmost diagonal entry of ©, is 
positive. 

Problem VIII.7 (Block inverse QR) Refer to the block RLS algorithm of Prob. VIL36. Follow 
the derivation in Sec. 35.1 to derive the following array variant for it. 

Let wy denote the solution of the least-squares problem 


N 
min le — 4)'l(v — 0) + yd; — Ujw)' R; (d; - uso) 
j=0 
Then wy can be recursively computed as follows. Let X = II^! and introduce the Cholesky fac- 
torization © = X:1/2Y;*/2, where X? is lower triangular with positive-diagonal entries. Then start 
with w-; = uU, pi? = Y? and repeat for i > 0. 
1. Find a unitary matrix ©, that lower triangularizes the pre-array shown below and generates a 
post-array with positive diagonal entries. Then the entries in the post-array will correspond to 


RP WES o| xw 0 
0 pi LS G,Ci/? pi? 


where R; = R; / ? R17 and T7! = C] / al /2 denote the Cholesky decompositions of R; 
and I’; ', respectively. 
i 1/2 1/2) 7! 
2. Update the weight vector as wi = wi-1 + [ec? | |c; ] [di — Uiwi-1], where the 
quantities (G; eu 50! /2) are read from the post-array. 
Problem VIII.8 (Block QR) Refer to the statement of the block RLS algorithm in Prob. VII.37. 


Follow the derivation in Sec. 35.2 to derive the following array variant for it. 
Let wy denote the solution of the least-squares problem 


N 
min (w — &)*H(w — 0) + CE — Ujw)* R; (d; - use| 


j-0 


and let d; = d; — U;w. Then ww can be recursively computed as follows. Start with $i i =i, 
q-1 = 0, and repeat for i > 0. 


1. Find a unitary matrix ©; that lower triangularizes the pre-array shown below and generates a 
post-array with positive diagonal entries in the lower triangular factors $i /? and D 7? in the 
post-array. The entries in the post-array will then correspond to 


Di UMSO 917 0 
da RP |e = qt eri? 
0 Ry“? Rue"? p? 


where R; = RY ? gi ? and T; =T} / pr /? denote the Cholesky decompositions of R; and 
T, respectively. 

2. Obtain w; by solving the triangular linear system of equations ©; É [wi — w] = qi, where the 
quantities {*/?, g;} are read from the post-array. 


Problem VIII.9 (Block extended QR) Refer to the statement of the block RLS algorithm in Prob. VII.37. 


Follow the derivation in Sec. 35.3 to derive the following array variant for it. 
Let wy denote the solution of the least-squares problem 


N 
min le — i5)'I(w — wo) + y» — Ujw)' R; ! (dj — Ujw) 
j-0 


and let d; = d; — Ujw. Then wy can be recursively computed as follows. Start with !/2 = I11/2, 
$71? =T1-*/?, q- = 0, and repeat for i > 0. 
1. Find a unitary matrix ©, that lower triangularizes the pre-array shown below and generates a 
post-array with positive diagonal entries in al /? ang ri ? in the post-array. The entries in the 
post-array will then correspond to 


a UR a? 0 
Gar GR? gg qi ar? 
0 RU : RU"? pra 
quu. 0 o"? Gr? 


where Ri = Ri/?R7/? and T; = F'1"?T7/? denote the Cholesky decompositions of R; and 
Ti, respectively. 


2. Update the weight vector as wi = wi-1 — (-a.r7*/?) . (eri, y. 


Problem ViII.10 (Numerical example) Transform the pre-array 


0.27 0.14 0.14 0.35 x 0 00 
A= 0.22 0.36 0.21 0.25 to the form x x 00 
—0.22 0.28 -—0.14 0.30 x x x 0 


by using (i) Givens rotations, (ii) Householder transformations, and (iii) a mixture of Givens and 
Householder transformations. 


Problem VIII.11 (Effect of perturbations) Refer to the discussion in App. 35.A on Kalman fil- 
tering, and consider a one-dimensional model of the form z(i + 1) = aw(i) and y(i) = z(1) + v(i) 
with a < 1, E|z(0)|? = 1, and Ev(i)v" (j) = di. 
(a) Verify that, for this model, the covariance form and the information form recursions for the 
Riccati variable are given by 


a^p(ili — 1) 


p+ 10 = Tipai- 1) 


and p '(iv1[) = Zi» -1)+1 
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with initial condition p(0| — 1) = 1; here p(2|i — 1) is a scalar variable. 
(b) Assume that at iteration i5, a perturbation is introduced into p(?, i; — 1), say due to numerical 


errors. No further perturbations are introduced afterwards. Let p(i|; — 1) denote the Riccati 
variable that results for 4 > io from 


a?^p(i|i — 1) 


P 
B1) = 7 apai- 


Pliolio E 1), i 2 to 

Let also 5^ (i|i — 1) denote the inverse Riccati variable that results for i > i, from 
xe A. ub eds "EN : : 
p (i+ 14) = qi? (ii — 1) +1, Pp (tolio — 1), i > to 

Show that 9^? (i + 1|i) — p^ 1 (i + 1i) = [D (iji — 1) — p71 (iji — 1)]/a?, whereas 


he ei sx ter: a^ (ili - 1) - pili - 1)) 
PG 1) 7 p + 18) = TT Ey epi — 1) + ot = Dp = 1) 


(c) Conclude that the recursion for p(i|i — 1) recovers from the perturbation, while the error 
p(i|t — 1) — p(i|i — 1) grows unbounded! 


Remark. This problem shows how different (but mathematically equivalent) filter implementations can behave 
quite differently under perturbations. 


Problem VIII.12 (Inverse QR and array covariance form) In Sec. 31.2 we showed that the 
exponentially weighted RLS algorithm can be obtained via equivalence to the Kalman filter as fol- 
lows. We start from the state-space model (31.8) and write down the Kalman recursions for estimat- 
ing its state variable. Then we translate the Kalman variables into the RLS variables by using the 
correspondences summarized in Table 31.2. In this problem, as well as in Prob. VIII.13, we want to 
apply this same procedure to the array variants of the Kalman filter that are described in App. 35.A. 
In this way, we would be able to recover the array variants of RLS as special cases. Thus consider 
model (31.8), which is a special case of the state-space model (35.28) with R; — 1, G; — 0 and 


Qi — 0. 


(a) Use (35.33) to verify that the corresponding covariance array form is given by 


1 wi Pay Bee ri^) 0 
0 AB i kp,ire!? (i) PI 


where ©; is any unitary matrix that triangularizes the pre-array. Moreover, 
Bina AU + hpi), v(i) = y(i) - usas 
(b) Use the correspondences from Table 11.1 and replace the Kalman variables 
(Pii-1; re (2), kpi, Pi+ijis eiii v ()) 


by the corresponding RLS variables. Verify that the array form of part (a) would then reduce 
to the inverse QR array of Alg. 35.1, i.e., 


LAM uP te Y 0 
0 ACRES i 1 i 


while the state estimator leads to 
-1 
ws = wien [o^ 9] 7^6] UO - uui] 


Remark. Problems VIIL12, VIII.13, and later Prob. IX.16, clarify the relation between RLS array algorithms and 
their Kalman filtering counterparts, following the work of Sayed and Kailath (1994b). 


Problem VIII.13 (Array QR and Kalman information form) We follow the procedure of Prob. VIII.12 
and relate now the QR array form of RLS (cf. Alg. 35.2) to the array information form (35.36) of the 
Kalman filter. So consider again model (31.8). 


(a) Use (35.36) to verify that the corresponding information array form is given by 


Mgt? Mr Si, 0 " 
ea, y'( |9i- eiui; v"'(i)re : (i) 

* 1/2 —1/2,. 

0 1 ai re (i) 


where ©, is any unitary rotation that lower triangularizes the pre-array and ©;);_1 = P; : 


iji-1° 
(b) Use the correspondences from Table 11.1 to verify that the array form of part (a) reduces to 
the QR array form of Alg. 35.2, i.e., 


AM? ut $12 0 
Mgt, d'() |O: = qi e (iy "^ (1) 
0 1 ub; */? y? (4) 


with w; obtained by solving ©; Pay, = qi (assume, for simplicity, i) = 0). 


COMPUTER PROJECT 


Project VIII.1 (Performance of array implementations in finite precision) The purpose of 
this project is to compare the performance of RLS and one of its array variants in finite-precision. 
Refer to the channel estimation application shown in Fig. VIII.1. 


v(i) 


FIGURE VIII.t Adaptive configuration for channel estimation. 
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Generate five random samples of a channel impulse response sequence and normalize the norm 
of this impulse response to unity. Feed unit-variance Gaussian input data through the channel and 
add Gaussian noise to its output. Set the noise power at 30 dB below the input signal power. Train 
an adaptive filter for N = 200 iterations using RLS and the QR array method of Alg. 35.2. Use 
A = 0.995, II = 1 x 107®I and $ = 0. Assume also a finite-precision implementation is used 
with B, bits for signals and B. bits for coefficients (including the sign bit in both cases). In order 
to simulate the quantized behavior of the filters, you may use a routine quantize.m. The routine 
receives as input two parameters: a real number z and the total number of bits, B (including the sign 
bit), for its quantized representation. The routine then returns a real number that corresponds to the 
quantized value of z. This value is determined as follows. With B total bits, the largest integers that 


can be represented are 
+2- 1 


If x exceeds these extreme values then its quantized representation is taken as either one of them, 
depending on the sign of z. If, on the other hand, z falls within the interval 


ze (207? - 1, 20579 -1) 


then the routine determines how many bits are needed to represent the integer part of z, and the 
remaining bits are used to represent the fractional part of z. 

For each algorithm (RLS and QR array method), generate an ensemble-average learning curve by 
averaging over 100 experiments and for the following choices: 


1. By = Be = 30 bits. 
2. By = Be = 25 bits. 
3. B, = B. = 16 bits. 
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Hyperbolic Rotations 


l. is sometimes necessary, especially when deriving fast least-squares algorithms (as we 
shall discuss in the next chapter), to employ J —unitary (also called hyperbolic) transfor- 
mations, as opposed to unitary transformations, in order to annihilate certain entries in a 
pre-array of numbers. A J —unitary transformation O is one that satisfies 


0J0*—0*JO- J 


for some signature matrix J, i.e., a diagonal matrix with +1 entries. The special case J = I 
corresponds to unitary transformations and was studied in Chapter 34. In this chapter, we 
extend the results to the J—unitary case, starting with Givens rotations and followed by 
Householder transformations. 


36.1 HYPERBOLIC GIVENS ROTATIONS 


As in our discussions in Chapter 34, we again distinguish between the cases of real data 
and complex data. 


Real Data 
Thus, consider a 1 x 2 real-valued vector z = [ a b ], and assume that we wish to 
determine a 2 x 2 matrix O that transforms it to the form: 


[a b]e-[o 0] (36.1) 


for some nonzero real number o to be determined, and where O is required to be hyper- 
bolic, i.e., it should satisfy 


0JO0' 20'7JO -J where  J —diag(1,—1) 


Unfortunately, and in contrast to the case of orthogonal Givens transformations in Sec. 34.1, 
the transformation (36.1) is not always possible. To see this, note from (36.1) that 


ster] «ie ofa] ate [a A] 


i.e., a? — b? = o?. Now since a? > 0, no matter what a is, this means that the transfor- 
mation (36.1) is only possible if |a| > |b|. When this is not the case, i.e., if |a| < |b], then 
we should seek instead a hyperbolic rotation O that transforms z into the alternative form 


[a b]e2[0 a] (362) 
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In this case, the transformation (36.2) will guarantee a? — b? = —a?, which is consistent 
with the fact that a? — 5? < 0. So let us examine the cases |a| > |b| and |a| < |b| separately. 


la| > |b|| (first entry of the vector is dominant) 


In this case, an expression for © that achieves (36.1) is given by 


(36.3) 


It is a straightforward exercise to verify that 
[a bje = | + Va? — 0? 0 | 


where the sign of the resulting œ depends on whether the value of the square-root in the 
expression (36.3) for © is chosen to be negative or positive. 

The reason for the denomination hyperbolic rotation for © can be seen by expressing © 
in the form 


e-| ch -—sh : 


1 p 
h = — = — 
—sh "i i [= p 2 A- p 
in terms of hyperbolic cosine and sine parameters, ch and sh, respectively. In this way, 


we find that the effect of O is to rotate any point (z, y) in the two-dimensional Euclidean 
space along the hyperbola of equation x? — y? = a? — b? — see Fig. 36.1. 


Plot of the function x^-y^s1 for xe[-2,2] 


y-axis 


x-axis 


FIGURE 36.1 A hyperbolic rotation in 2—dimensional Euclidean space moves points along the 
hyperbola of equation x? — y? = a? — b?. In the figure, a? — b? = 1 and point A is moved into point 
B. 


ib] > |a|| (second entry of the vector is dominant) 
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In this second case, an expression for © that achieves (36.2) is given by 


(36.4) 


which now leads to the result 
[a b]e 2 [0 +Ve? =a? | 


In summary, we arrive at the following conclusion. 


oo, 
Lemma 36.1 (Real hyperbolic Givens) Consider a 1 x 2 vector [a b] with real 
entries. If |a| > |b|, then choose © as in (36.3) to get 


[a b]eszya?-U[1 0] 
If, on the other hand, |a| < |b], then choose © as in (36.4) to get 


[a b]e 2 zy? -a[0 1] 


Complex Data 
More generally, consider a 1 x 2 vector z — [ a b ] with possibly complex entries, and 
assume that we wish to determine an elementary 2 x 2 matrix O that transforms it to either 
forms (36.1) or (36.2) with a possibly complex-valued a, and where O is now required to 
satisfy 

0JO* 20*JO—J where J —diag(1,—1) 


We again need to distinguish between two cases. 


ja] > |b|) (first entry of the vector is dominant) 


In this case, we choose O as 


"leo ae p : (36.5) 


v1 - |o? 


This choice achieves the transformation (36.1). Specifically, it gives 


[a b]O = [xe ya- 0] 
where ¢, denotes the phase of a. 
|a| < |b|| (second entry of the vector is dominant) 


Now we choose O as 


(36.6) 


This choice achieves the transformation (36.2). Specifically, it gives 


[a bje - [0 +h fb? —lal? | 


where d, denotes the phase of b. In summary, we are led to the following conclusion. 


Lemma 36.2 (Complex hyperbolic Givens) Consider a 1 x 2 vector [a 6] with 
possibly complex entries. If |a| > |b|, then choose © as in (36.5) to get 


[a b ] O = + ets la|? — |b)? | i 0 | 
If, on the other hand, |a| < |b|, then choose © as in (36.6) to get 


[a &]e- xe -ig[o 1] 


36.2 HYPERBOLIC HOUSEHOLDER TRANSFORMATIONS 


In contrast to hyperbolic Givens rotations, Householder transformations can be used to an- 
nihilate multiple entries in a row vector at once. We describe them below for both cases of 
real and complex data. 


Real Data 
Let eo denote the leading row basis vector in n —dimensional Euclidean space, 
e=[1 9.8] 
and consider a 1 x n real-valued vector z with entries {z(i),i = 0,1,...,n — 1}. Assume 


that we wish to transform z to the form 
[ 2(0) z(1) ... z(n-1) ]O= ae (36.7) 


for some nonzero real scalar œ to be determined, and where the transformation © is re- 
quired to be both J-orthogonal and involutary. By a J—orthogonal transformation we 
mean one that satisfies 

eJe'=e'JO=J 


for some given signature matrix J with + diagonal entries, usually of the form 
J = (I5 6 -,), p21, q21 


and by an involutary matrix © we mean one that satisfies O? = I. 
Again, and in contrast to the case of orthogonal Householder transformations in Sec. 34.1, 
the transformation (36.7) is not always possible. To see this, note from (36.7) that 


20J9!';! = aeg Jena = a 
SS 
J 


ie, zJz! = o?. Now since o? > 0, no matter what the value of o is, this means that the 
transformation (36.7) is only possible whenever z.Jz™ > 0. If, this is not the case, i.e., if 
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zJz" < 0, then we should seek instead a J —unitary transformation © that transforms z to 
the alternative form 


[2(0 2(1) ... z(n- 1) JO= ae (36.8) 
where €,,_1 is the last basis vector, 
€n-1 = [ 0 ac n) 1) 


In this case, the transformation (36.8) will guarantee zJz™ = —a?, which is consistent 
with the fact that zJz™ < 0. So let us examine the cases zJz! > 0 and zJz™ < 0 sepa- 
rately. 


(positive value) 


To determine an expression for © that meets the requirement (36.7), we can follow the 
same geometric argument that we used in the orthogonal case in Chapter 34, except that 
we replace gg! by gJg! and the inner product zg’ by zJg! . Therefore, the first step is to 
choose a. = +VzJz" and then to write 


E 
— z—9zJa' (aJa Y-1o = z II - 2J-9-2. | 
aeo = z — 2zJg (gJg ) g e| YT. (36.9) 
N ——ÀÓÁ 
ŝo 


The indicated matrix © is called a hyperbolic Householder transformation; it is both J- 
unitary and involutary. 


(negative value) 


In this case, we can also follow the same geometric argument to determine an expression 
for © to achieve (36.8). The first steps are now to choose œ = +v —zJz! and g = 
Z — Qen—1, and then to write 


Qen-1 = z—2zJg (gJg) |g—z i — uif (36.10) 


The indicated matrix O is also a hyperbolic Householder transformation; it is both J- 
unitary and involutary. In summary we arrive at the following statement. 


Lemma 36.3 (Real hyperbolic Householder) Consider an n —dimensional vec- 
tor z with real-valued entries. If zJz™ > 0, then choose g = z+ VzJzl eo 
and O as in (36.9) to get 


20 = xVzJzleg 
If, on the other hand, zJz™ < 0, then choose g = z + /—zJz'e,_; and © 


as in (36.10) to get 
zO = Fy -zJzl en_1 
a RENÉ 


Complex Data 

More generally, consider a 1 x n vector z with possibly complex entries, and assume that 
we wish to determine a transformation © that transforms it to either forms (36.7) or (36.8) 
with a possibly complex-valued a, and where O is now required to satisfy 


0JO*20*JO—-J and G?-I 


for some given signature matrix J and with the same involutary condition. We again need 
to distinguish between two cases. 


(positive value) 


Introduce the polar representation of the first entry of z, namely, let z(0) — [a|e/?». Then 
choose g = z + VzJz* e/?«eg and © as 


gijo 99 (36.11) 
9J9 


This choice gives 
20 = x VzJz* aeg 


(negative value) 


Introduce the polar representation of the last entry of z, namely, let z(n — 1) — lbet% 
Then choose g = z + /—zJz* e/%e,_, and O as in (36.11). This leads to 


20 = xzV-zJz* teni 


In summary, we arrive at the following conclusion. 


Lemma 36.4 (Complex hyperbolic Householder) Consider a 1 x n vector z 
with possibly complex-valued entries. If zJz* > 0, then choose g = z+ 
VzJz*eI% eg and © as in (36.11) to get 


20 = pelte Vz Jz*eq 


If, on the other hand, zJz* < 0, then choose g = z+ /—zJz*e!*e,_ 1 and 
O as in (36.11) to get 


zO = xe? V—zJIz* eu. 


Here (ae?«, bej?* ) denote the polar representations of the leading and trail- 
ing entries of z, respectively. 


Remark 36.1 (Improved numerical accuracy) Computations with hyperbolic transformations 
in finite-precision can face numerical difficulties due to the possible accumulation of roundoff errors. 
If p = qO, and if p denotes the computed vector that results from the evaluation of the product q9 
in finite precision, then it is known that 


lp ~ l| € O(9- Ilall- IO] 
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where ||©}] denotes the spectral norm of © (i.e., its maximum singular value), e denotes the machine 
precision, and O(e) denotes a quantity of the order of the machine precision. This result assumes 
floating-point arithmetic and can be found, e.g., in Golub and Van Loan (1996).. Now since hyper- 
bolic rotations O can have relatively large norms, we find that, in general, the computed quantity 
f need not be evaluated accurately enough. Still, there are some careful ways for implementing 
hyperbolic transformations (especially hyperbolic Givens transformations) that help ameliorate nu- 
merical problems in finite-precision implementations. Two such methods are described in App. 14.A 
of Sayed (2003). 

o 


36.3 HYPERBOLIC BASIS ROTATIONS 


We now extend the result of Lemma 33.1 by replacing the equality AA* = BB* by 
AJ A* = BJ B*, for some signature matrix J. We start with the following statement. 


Lemma 36.5 (More columns than rows with a full-rank requirement) Conside 
two nx m matrices A and B with n < m (i.e., the matrices have more columns 
than rows). Let J = (1,9 —1,) be a signature matrix and assume that AJ A* 


has full rank. if AJA* = BJ B*, then there should exist a J-unitary matrix 
© that maps B to A, i.e., A= BO. 


Proof: Let S! = AJA*. Then $^! is n x n Hermitian. Moreover, $^! = BJB*. Let 
In(S-!) = (o, 8} denote the inertia of S^! , with a+ = n, and introduce the two block triangular 


factorizations:'? 
$7 A I gi I à 
A* J A'S I J — A*'SA A*S I 
—1 * * 
k Taa i[' a 
=0 
I J I 
Using Sylvester’s law of inertia (cf. Lemma B.5), we conclude that the center matrices in the above 
factorizations must have the same inertia. In other words, it must hold that 
In{J — A*SA} = In{J} - In(S^!) = {p - o,q— 8,n) 
where, from the definition of J, p + q = m. Similarly, In(J — B*SB} = (p — o,q — b,n}. We 
therefore find that J — A*SA and J — B*SB are m x m matrices with n zero eigenvalues and 


m — n nonzero eigenvalues. These matrices can then be factored as J — A*SA = XJ, X* and 
J — B'SB = Y JıY*, where Jı = (Ip-a © —I~g) and {X,Y} are m x (m — n). Now define 


the square matrices 
_| A _| B 
à " | x" | i "n " | Y* | 


Xi(So0J4)3012J and  £2(80J1)932-J (36.12) 


Then it follows that 


so that £1 and Y» are invertible. Multiplying the first equality by JX] from the right we get X1 (S € 
I)E (JEI) = J(JET) = Z1, so that X1JX1 = (S^! @ J1). Likewise, X3J35 = (87! & Ji) 


19This elegant argument was suggested to the author by his late colleague Professor Tiberiu Constantinescu. 


and, consequently, ©: JE} = X2JX3. From (36.12) we have that JD] = X1 (S^! @ J) so that 
Y, = XOG[JE2(S O6 Ji) i]. If we set © = [JE}(S @ J1)2], then © is J-unitary and, from the 
equality of the first block row of 4 = X20, we get A = BO. 

o 


In the above statement, the matrices A and B were either square or had more columns 
than rows (since n € m). We can establish a similar result when n > m and A and B have 
full ranks. For this purpose, we first note that if A is n x m and has full rank, with n > m, 
then its SVD has the form 

A-U | x | ve 


0 


where X is n x n and invertible. The pseudo inverse of A is At & y [ ich 0 ] U* 


and it satisfies A! A = Im. In other words, the matrix A admits a right inverse. A similar 
conclusion holds for B. 


Lemma 36.6 (More rows than columns) Consider two n x m matrices A and 
B with n > m (i.e., the matrices have more rows than columns). Let J = 
(Ip & —I,) be a signature matrix and assume that A and B have full rank. 


The equality AJA* = BJB* holds if, and only if, there exists a m x m 
J —unitary matrix © such that A = BO. 


Proof: The “if” statement is immediate. If A = BO, for some J —unitary ©, then clearly AJA" = 
BJ B*. For the converse statement, assume AJ.A* = BJ B* and define © = Bt A. Then this choice 
of © is J—unitary and it maps B to A. Indeed, note that 


BiAJA*BY = BIBJBIBBB-J 
Q e* I I I 


Moreover, from the relations AJA* = BJB*, B! B = L4, and A! A = Im we obtain the following 
equalities: 
AJA* = B(B'B)JB' = BB'(BJB") = BB'AJA* 
—— 


Im 


which, upon further multiplication from the right by A™, give 


AJA*A™ = BB'AJA'A"- BBAJ 
Im m 


That is, AJ = BOJ. But since J is invertible we arrive at A = BO, as desired. 
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Fast Array Algorithm 


A. least-squares algorithms studied so far, including RLS, inverse QR, QR, and ex- 
tended QR algorithms, do not assume any structure in the data. As a result, the computa- 
tional complexity of each of these algorithms is O(M?) operations per iteration, where M 
is the order of the filter. However, when data structure is present, more efficient implemen- 
tations are possible. 

Thus, consider a collection of (N + 1) data (d(7), uj) where the (uj) are 1 x M 
and the d(j) are scalars. All the aforementioned algorithms are recursive procedures for 
determining the solution wy, and the minimum cost £(N), of the regularized least-squares 
problem: 


N 
min MND (w — D)w — à) + $0 AN? dC) - uw]? (37.1) 
j=0 


where w is M x 1, II > Ois M x M and 0 « à< 1. In particular, RLS evaluates wy 
recursively as follows (cf. Alg. 30.2). Start with w. 4, = i», P_; = II 1, £(—1) = 0, and 
repeat for i > 0: 


yi) = 1/4 +A uP) -1-uBul 

g = SWP- = Put 

e(i) = d(i)-uwii 

Wi; = Wi- gie(i) (37.2) 
P, = X!Ba- giri) 


r(i = d(i)-— uwi 


Eli) = A€i—1) + r(i)e*(i) 


These equations hold irrespective of any structure in the {u; }. 

Now, it is often the case that the regressors (u;) exhibit some form of structure. In the 
chapters in this part, we study the case in which the {u;} arise as regressors of a tapped- 
delay-line implementation, as shown in Fig. 37.1. That is, we assume that the entries of u; 
are formed from time-delayed samples of an input sequence (u(:)), say, 


u=( uli) wi-1l) ... wi-M-*2) u(i-M-1)] (37.3) 


In this case, two successive regressors will share most of their entries since, for example, 
ui—1 Will be given by 


ui-i-|uw(i-1) wi-2) ... wi-M-1) u(i- M)] (37.4) 
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FIGURE 37.1 A tapped-delay-line structure resulting in regressors with shift-structure 


Comparing expressions (37.3) and (37.4) for u; and uj. we see that u; is obtained from 
u;—1 by shifting the entries of the latter by one position to the right and introducing a new 
entry, u(i), at the left. One way to capture this relation is to note that the following equality 
holds: 

(37.5) 


That is, if we extend u;_; by one entry to the left and u; by one entry to the right, then the 
extended vectors will coincide. 

The shift structure in the regressors can be exploited to great effect in order to devise 
efficient recursive least-squares solutions. By efficient we mean algorithms whose com- 
putational complexity is O( M) operations per iteration, as opposed to O(.M?); i.e., they 
are an order of magnitude more efficient than the slower implementations that we studied 
before (RLS, inverse QR, QR, extended QR). There are several classes of efficient RLS al- 
gorithms that can be derived by exploiting data structure. In this part we study fixed-order 
algorithms, while in Part X (Lattice Filters) we study order-recursive algorithms. 

We start our discussions by deriving an efficient RLS algorithm in array form; we do 
so since the derivation of the array form is more immediate than other efficient RLS algo- 
rithms. Later, in Secs. 39.1—39.3 we derive efficient algorithms in explicit forms. 


37.1 TIME-UPDATE OF THE GAIN VECTOR 


Consider the RLS update (37.2) for w;, namely, 
wi = wi-i- gi|d(1) — uwi-i] (37.6) 


and note that RLS requires the gain vector g; in order to compute w;. In turn, the evalu- 
ation of g; requires the matrix P;_, (or P;), and updating {P;_, or P;) requires O(M?) 
operations per iteration. This update step is the main computational bottleneck in the RLS 
algorithm. However, when the (u;] have shift structure, as indicated by (37.5), the gain 
vector g; can be evaluated more immediately without requiring evaluation of P;.. (or P;). 
This will be achieved by developing a time-update for g; itself, i.e., by showing how to 
compute g; from g;_-1 directly. 
From the definition of g; in (37.2) we have that 


gy) = A 1Biiuf and  giiaiy (4-1) = AT! Piau} (37.7) 


Now, because of the shift structure in the regressors {u;,ui_1}, we can express the ex- 
tended vector col( P;_1u*, 0} in the equivalent forms 


pare laji JM m - E AI (37.8) 
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where in the second equality we used (37.5). Likewise, we can write 


0 = 0 0 u* (i) 
RN = b A a Ton 


Consequently, subtracting (37.8) and (37.9), we get 

-1 | Pauj | Qa 0 bei’ | P- UEM M ME u* (i 
xu ir ad Lac] ES Sos ae 
Let 6P;_1 denote the difference 


, A|B-.10| |0 0 
mas [A t]- [eun] acensaren 


Using the defining relations (37.7) for (9i, gi-1}, we arrive at 


erea] 


(37.10) 


gi-iY (i — 1) i-1 


This is a significant result. It shows that in order to time-update the gain vector from 
gi-vY (i — 1) to gy7} (i), it is only necessary to know what the difference óP;.. , is; it is 
not necessary to know the value of the individual matrices (P;, P; .;) themselves. In this 
way, it suffices to know how to update 6 P;_, to P; in order to carry out the updates: 


= H [1228 - . 6P; EE ] 
9-17 (i 1) =F giY (0) —5 Gary (i1) — ... 


37.2 TIME-UPDATE OF THE CONVERSION FACTOR 


Although knowledge of 5P;_; is enough to update g; ^y^! (i — 1) to gy7} (i), we still 
need to recover g; itself. In other words, we still need to know how to remove the scaling by 
*y 1 (1) from g;Y ^? (i). If we were to evaluate y(i) as suggested by the RLS implementation 
(37.2), in terms of P;..; (or P;), then we would be back to square one in terms of excessive 
computational complexity. However, it turns out that the conversion factor (i) can also 
be time-updated in a manner that only requires knowledge of óP;..,. 

To see this, we use the expression for y(i) in (37.2) to write 


yo) = 14d tu Peiut, yili- 1) = 14A uaPi uli 
so that, upon subtraction, we get 
yh) 2 371 (i — 1) A7! [uj Bi-iu? — ui- Pi-2u; i] 
Again, using the partitioning (37.5), we can express the difference 


^ 
V = uPi2iw; — uisiPio2ua 


vb ear 8]- [8s DT 
[ ui) wa ona | | 


as 


and, hence, 


(37.11) 


which shows, as desired, that knowledge of 6P;_ is also sufficient for the time-update of 
the conversion factor. 


37.3 INITIAL CONDITIONS 


In order to complete the argument, we need to show how to compute the factor 6P,_1 
that appears in (37.10) and (37.11) in an efficient manner. This step requires that the 
regularization matrix II be chosen in an appropriate manner, as we now explain. 

Assume u(i) = 0 for i < 0 so that u_1 = 0 (ie., the initial state of the filter is zero). 
Then g-1 = 0, y(—1) = 1, P-1 = II), and P. ; = AP_, = XII“, so that the difference 
bP; at time —1 is given by 


= Pa 0 ín 0 0 = H^ 0 B 0 0 
Bash | pela ws) coques] Om 


The value of this difference depends on our choice of II. So assume that we choose II! 
as the diagonal matrix 


I! = g- diag {A?,A3,..., AM] (37.13) 


where 7) is a positive scalar (usually large) and À is the forgetting factor. Then óP. 
becomes 
6P_, = nM - diag(1,0,...,0, -AM) 


That is, ÓP.., reduces to a rank-two matrix with one positive eigenvalue and one negative 
eigenvalue. Actually, it is not difficult to verify that 6P_, can be factored as 


ôP 2X LS L5, 


where L_, is (M + 1) x 2 and S. , is 2 x 2 and given by 


(37.14) 


The signature matrix, S..;, indicates that the difference 6P_, has one positive eigenvalue 
and one negative eigenvalue; all other eigenvalues are zero since the rank is 2. 

This argument shows that the choice (37.13) for II leads to a rank 2 difference matrix 
5P_, with signature (1G —1). A striking property that is established in the next subsection 
is that the rank and inertia properties of ó P; remain invariant over time; the rank never goes 
up; it is always 2, with the same signature matrix S_1, except in rare degenerate situations 
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where the rank can only drop below 2. More specifically, we argue ahead that once the 
low-rank property holds at a certain time instant 7 — 1, say, 


P,- F - 
| o J J i A | =A Li15i-1b7 (37.15) 


0 0 


for some (Li-i S;-1}, then three important facts hold: 
1. The low-rank property will be valid at time i as well, say, 


EUN UR NOT 2l 
| 0 J | 0 Pi | —A-LiSiL; for some (Li, Si). 


2. There will exist an array algorithm that updates L;.., to L;, and this same algorithm 
will also provide the gain vector g; that is needed in (37.6). 


3. The signature matrices (5;..;, 5;) will coincide. 


37.4 ARRAY ALGORITHM 


The desired array algorithm for updating (L; 1, gi 1, y(i — 1)) to (Li, gi, y(i)) can be 
motivated and derived in much the same manner as we did in Sec. 33.4. 
Thus, starting from (37.10) and (37.11), namely, 


gy. eH d ua) ass sal v9 | 


| sni) | - | -— | ME Scu des | a i-i 


we note that these expressions are of the form 
CC* = AA* + BSB", FC* = DA* + ESB* (37.16) 


with the following identifications 


C — 7) A c 4714 — 1) Be | ut) wa]hLa 
gr O) 0 F. 
ag | 0 im gi-Yy "(i — 1) ls. 


S — Sii 


Generic Array Algorithm 
The forms (37.16) play a role similar to the "norm-preserving" and "inner-product pre- 
serving" equalities (33.19) and (33.20), which we used in Sec. 33.4 to motivate and derive 
array algorithms. The main distinction is the presence of the signature matrix S in (37.16). 
Still, the same arguments that we employed in Sec. 33.4 could be repeated here with the 
only issue being that we now need to deal with hyperbolic transformations as opposed to 
unitary transformations. 

More specifically, as in (33.21), given general equalities of the form (37.16) where it 
is desired to evaluate (C, F} from knowledge of (A, B, D, E, S), we would form the 
pre-array 


(37.17) 


DE 


A~|4 A and reduce it to the form E 2 


by annihilating B by means of a transformation © that is now required to be (I S)-unitary, 


i.e., it should satisfy 
I ELE 
e| sJe-[' s] 


The question of course is whether such a O exists. While in Sec. 33.4 we started from 
equality (33.19) and appealed to (33.9) to justify the existence of a O that achieves (33.21), 
this same argument cannot be applied to the equality 


CC* = AA* + BSB* (37.18) 


due to the presence of the signature matrix S. However, Lemma 36.5 already extends the 
result (33.9) to handle such more general cases with signature matrices. Specifically, by 
writing (37.18) as 


bel PUA) rp 


we are able to appeal to Lemma 36.5 to conclude that an (I & S)—unitary © exists that 
maps [A B]to [C 0]. Therefore, in a manner similar to the explanations that led to 
(33.21), in order to determine (C, F} we would first use one such (I & S)—unitary matrix 
O to perform the transformation 


A B o= X 0 
C D {YZ 
and then “square” and compare entries on both sides of the equality: 
A B]g[1 edem psp x 0] 
C D S C D CENE E S Y Z 
—M— 
(IOS) 
in order to identify X as C and Y as F. 


Fast RLS Array Algorithm 
Applying this construction to the RLS case (37.16), we would first form the pre-array 


a7? (4 — 1) [u(i u1 | li~ 


| 0 
gi-1y "(i — 1) 


For example, when M = 3, this array has the form (recall that L; is (M +1) x 2): 


A= 


Li-i 


Then we would define the signature matrix J = (1&5; 1), and choose a J —unitary matrix 
O; (i.e., O;JO7? = J) such that it transforms A to the form 


| |x 0 
s-|; 2 (37.19) 
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for some quantities { x, y, Z} to be determined, with x a positive scalar, y a column vector, 
and Z a two column matrix. Again, for M = 3, the post-array will be of the form: 


The matrix ©; can be implemented in many ways (as described in Chapter 36). For exam- 
ple, we could employ a circular (Givens) rotation that pivots with the left-most entry of the 
first row and annihilates its second entry. We could then employ a hyperbolic rotation that 
pivots again with the left-most entry and annihilates the last entry of the same row: 


x 0 x" 00 
0 x x 7 7 x" x x" 

Givens x 2s x hyperbolic fi A T 
x X X = x! x x <= x X X 
x x x x! x! x x" x! x! 
x x x x! x! x x" x! x" 


Now, in order to identify the entries { x, y, Z} in the post-array (37.19), i.e., in the equality 
below 


y I — 1) [ui uaa | Ler 0 
f PT Lia le-[; z | 
giciY "*(— 1) 
M 


we simply compare entries on both sides of the equality .40;J97.A* = BJB* to find that 


| x |? = «47 (-1)-[u() uii | L-19i-12h, [o | 
i—l 
: " 0 = o pe [w 
a | gi- (i — 1) | Piet | ui | 


0 0 M m 
| 9-1 (i) | gicvy |"? (i — 1) | + Li-15i-1 Li- 
(37.20) 


yy* + ZS;-12" 


The right-hand side of the first equality coincides with that of (37.11) so that we can iden- 
tify x as x = y~'/?(i), Similarly, the right-hand side of the second equality in (37.20) 
coincides with that of (37.10), so that we can identify y as 


d | ga 2(i) | 


Finally, the last equality in (37.20) leads to ee? 


SECTION 37.4 

‘ $ š ARRAY 

Z8" = a ; ALGORITHM 
i | gix1y V"? (i — 1) | | gi- (i — 1) | 


oP t]-[59.]- [nt] ] 


= 0 0 + AIP; 0 = 0 0 
0 ATP. -— Pe 0 0 0 A Pi» 


NO. EON 
0 0 


BOUT «d 
[a o]-[s s] n 


The difference ó P; is, by definition, ALjS;L*, so that 


D 


25 32 ADSL: 


This result shows that the difference ôP; has the same signature matrix, S; 1, as 6P,;-1 
and, consequently, S; remains invariant for all i. For this reason, we shall drop the time 
subscript 7 from S; and write S instead. Moreover, we can identify Z as Z = Và- Li. In 
summary, we arrive at the following statement. 


Algorithm 37.1 (Fast RLS array algorithm) Consider data {u;,d( 3)HLo. 
where the u; are 1 x M and the d(j) are scalars. Consider also an M x 1 
vector Ù, a scalar 0 < A < 1, and an M x M positive-definite matrix II^! 
of the form 


II^! = 5. diagonal (42,03,..., AM*!), n >0 


When the {u;} correspond to regressors of a tapped-delay-line implementa- 
tion, the solution wy of the least-squares problem (37.1) can be recursively 
computed as follows. Start with w.; = w, 4y71/7(—1) = 1, g.1 20, L., 
and S as in (37.14), J = (1€ S), and repeat for i > 0: 


1. Find a J—unitary matrix ©; that annihilates the last two entries in the 
top row of the post-array below and generates a positive leading entry. 
Then the entries of the post-array will correspond to 


y7V2(i — 1) [ u(i) w-i]b 
E Lii : 
| gi-1Y (i — 1) 
vv?) [0 0] 
- | ar | JE, 


2. Update wi = wii + [gy (i) (7126) (d(i) — u;wi-i], where 
the quantities (g;^7 !/? (i), y-!/?(i)) are read from the post-array. 
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Observe that this array algorithm computes the gain vector g; without evaluating the 
M x M matrix P;. Instead, the low-rank factor L;, which is (M + 1) x 2, is propagated, 
resulting in a lower computational complexity. Later, in Table 39.1, we shall compare the 
computational requirements of several fast fixed-order variants. 


37.A APPENDIX: CHANDRASEKHAR FILTER 


The fast algorithms of this chapter have connections with a fast alternative to the Kalman filter for 
constant state-space models, known as the Chandrasekhar filter. 

Thus, refer to the material on Kalman filtering in Chapter 7 and consider again the standard state- 
space model (7.8)4(7.10) (say, of order n): 


Tipi = Fizi + Gini, i20 
y, = Hizi + vi 

Ni * Qi O0 0 

4 : õi . 

vi nj | 0 R | j 0 (37.21) 
E vj = 

To 0 0 Ilo 

To 
5 0 0 0 


We assume, without loss of generality, that S; = 0, i.e., that the processes (ni, v;) are uncorrelated. 
This case is sufficient for our purposes here. 

The Kalman filter recursions, in covariance form, for computing the innovations and, subse- 
quently, predicting the state vector are given by (cf. Alg. 7.1): 


Rei = R+ HiPy 1H? 

Kpi = FPH? RZ} 

vi = Y,- Hi&ji (37.22) 
$i = Fiĝi + Kpivi 
Pug = PF -TGQQiGi- Kp, i Rei Kp 


with initial conditions ĉoj—ı = 0 and Po;_1 = Io. Again, for convenience of exposition, we are 
denoting the innovations variable of the Kalman filter by v(i) in order to avoid conflict of notation 
with the symbol e(i) used to denote the output error of RLS. 


Array Chandrasekhar filter for constant models. We present first the array form of the 
Chandrasekhar filter. Its derivation relies on the same kind of arguments used in the body of the 
chapter while deriving the fast array RLS method. For this reason, we shall be brief. Once the array 
method is presented, the reader will be able to recognize the close similarity between the RLS and 
Kalman filtering domains (by pursuing Prob. IX.16). 

So assume initially that the model parameters are constant, say, ( F, G, H, Q, R}, and introduce 
the factorization P; 1; — Pii 1 = Lii i L5, ,, where Lj; is n x o, S is an o x o signature 
matrix with as many +1's as (P;411; — Piji-1) has positive and negative eigenvalues. Moreover, 

= rank(P;,;j; — Pij;-1). The array algorithm follows by forming the pre-array 


for some ©; such that 


I oj] |I O 


We can identify the {X, Y, Z} terms by comparing entries on both sides of the equality 


R Himala {1 olof Ree Hima | 
Kpi R}? FLii-i SEE £ Kos RIP Flaii 
— — 


sad a e [xos 
[5 AE AE 3 (37:23) 


(37.24) 


1/2 
Fee, E 
Kp isi Reipa Lii 


E 


where O is any (I«p $)—unitary matrix that produces the block zero entry in the post-array. In other 
words, this array form propagates the low-rank factor {L;);-1} as well as the gain matrix {Kp,;}. 
The algorithm does not involve computation or propagation of the Riccati variable P,j;1, which 
therefore results in an order of magnitude improvement in complexity. While the covariance form 
of the Kalman filter requires O(n?) operations per iteration, the above array form requires O(an?) 
operations per iteration and a « n usually. 


The initial conditions are Reo = R+HToH™* and Kp,o Rio = FIloH" Rz with (Lo\-1, 9) 
obtained from the factorization 


Pijo — Io = [FIloF" + GQG* — KoRzoKo — Mo] = Loi-1SLoj. 


Array Chandrasekhar filter for structured models. The connection between the Chan- 
drasekhar filter and the fast RLS algorithms is more natural to establish by developing an extended 
version of the Chandrasekhar recursions; one that applies to a class of non-constant but structured 
state-space models. The need for this extension can be seen by examining the state-space model 
(31.8) that is associated with RLS. In this model, the regressor u; plays the role of the matrix Hi 
in model (37.21). However, although the regressor is time-variant, it nevertheless varies in a struc- 
tured manner. For example, in tapped-delay-line implementations, the regressors will possess shift 
structure. 


To handle such structured situations, we define a class of structured models as follows. Returning 
to (37.21), we shall say that the model is structured if there exist n x n matrices (V;) such that the 
parameters (Fi, Hi, Gi) vary according to the rule: 


H; = Hivi, Fiati = Vinh, Git. Viii (37.25) 


We continue to assume that the covariance matrices {R;, Qi} are constant, although extensions are 
possible. Now introduce the factorization 


Pau — Ui Pir 4i = Lii-18Liii 


Then a fast array algorithm can be derived in much the same way as we did before for (37.24). We 
omit the details and state the final extended Chandrasekhar recursions: 


m 


1/2 
HU 0 


His Lii-1 
Kpa Rea Lisi 


FiaLiizi 


(37.26) 


1/2 
e,i 
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where @;+1 is any (IG.S)-unitary matrix that lower triangularizes the pre-array. Moreover, (Lo..1, $) 
are found from the factorization 


Pijo — YoPoj-1 Y3 = (Follo FS + GoQG5 — KoRzo Ko) — VolloW§ = Loy 1SLo.i 
Chandrasekhar filter in explicit form. Finally, we shall only state that the Chandrasekhar 


filter can be expressed in explicit form as follows. We write Kp, = Ki Rz; , and generate ( K;, Re,i} 
via recursions involving certain auxiliary sequences {Ljj;-1, Rei): 


EE 120 

Liu = (F-KRQH)Lq-i 3721) 
Reign = Rei - H Liy- Rri Li- H" l 
Reis. = Rri E Li, LH Ra HL 


The variables (L;jj..1, Rr,i} have the following interpretation: 
Pig — Piji-1 = —Ly-1 Rep Lis 


For structured state-space models we have 


Kir = Wai ki - Raila Rp oa Hia 

Lise = Fis Lyi — Win Ki Rj HoeaLqii (37.28) 
Reign = Rei — Hist Lii-1 Rp Liu- Hi 
Rripi = Rei — Loy 1H Ro His Lii- 


where now {L;);-1, Rr,i} have the following interpretation: 


Pay — ViPi-1 Vi = -Lii Rp; Lii 


ours BO 


Regularized Prediction Problems 


E ES derivation of the fast array algorithm in Chapter 37 was based on the realization that, 
by proper choice of the regularization matrix II as in (37.13), the successive differences 
8 P, , in (37.15) will have rank 2 with a constant signature matrix, S = (16 —1), and with 
two-column factors L;..,. In this chapter, we provide an interpretation for the columns of 
Lj... In the process of doing so, we shall arrive at other efficient implementations of RLS. 
These implementations will not be in array form, but in terms of explicit sets of equations; 
they are known as the fast Kalman filter, the fast a posteriori error sequential technique 
(FAEST), and the fast transversal filter (FTF). 

First, however, to facilitate the presentation, we need to adopt a more explicit notation 
in order to indicate the fact that the {P;, P;_1} are M x M matrices. For this reason, we 
shall write Py; and Py; i, instead of P; and Pj... We shall also write um, instead 
of uj, Im instead of II, wj; instead of wj, (i) instead of y(i), and gm, instead of 
gi. The subscript M in all these variables is used to indicate the order of the underlying 
estimation problem, i.e., the dimension of the regression vectors (uy). 

Thus, consider again the regularized least-squares problem 


i 
min AG*Dwt mwm +Y Adl) ~ um? (38.1) 
j=0 
with regularization matrix 
Im =n! - diag{A~?, 78, ... A (M*0), n >0 


and with data up to time 7. We assume i) = 0 from now on. The corresponding data matrix 
and measurement vector are denoted by 


UM, d(0) 
u d(1 

HS. aunt (38.2) 
uM, d(i) 


with a subscript M also attached to the data matrix in order to indicate its column dimen- 
sion. The weight vector solution is given by (cf. (30.28)) 


UM, = Pa itis My 
where the inverse coefficient matrix Pm, is given by 


Py = [A DTI, + Hy, Hua] 
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and 
A; = diag(A5, M^1,...,A, 1} 


In addition, the conversion factor at time 7 is given by 
yu (8) = 1— uy Pu ity (38.3) 


and the gain vector is 
gM = PM ity: 


Our objective is to examine the meaning of the low-rank factor L;. For this purpose, we 
shall derive an explicit expression for the difference 


Bo] [o o 
lii A E 2 (38:4) 


The derivation requires that we introduce the so-called backward and forward prediction 
problems. 


38.1 REGULARIZED BACKWARD PREDICTION 


Problem (38.1) projects, in a regularized and weighted manner, the measurement vector y; 
onto R(H m,i), where the data matrix Hm is given by (say, for M = 4): 


u(0 0 0 0 
u(1)  u(0) 0 0 
u(2)  u(1) u(0) 0 


( 
H= u3) u) al ( 
w4) w3)  w2) ul 


us) uic Doe IG 


Now assume that we augment H m, by one column to the right and define 


Hua; [Hmi A] (38.5) 
e.g., for M = 4, 
u(0) 0 0 0 0 
u(1)  w(0) 0 0 0 
u(2)  u(1) u(0) 0 0 
u(3)  u(2) u(1) u(0) 0 
Hsi— | u(4)  w(3) u(2) u(1) u(0) 
u(5 (4) u(3) u(2) u(1) 
cele it Wee) aute |a eg 


where the entries of the additional column h are 


h =col{0,...,0,u(0),u(1),...,u(i — M)) (38.6) 


FIGURE 38.1 Projection of y; onto R(H yi) and R(Hm+1,:) with the corresponding inverse 
coefficient matrices {Pm i, PM+1,i}- 


We can then consider the problem of projecting y; onto the extended range space R(H m+1,i), 
as shown in Fig. 38.1. 
This step requires that we solve the following estimation problem of order M + 1: 


i 
min |AG*D wh. Tsiwasi + > Adj) -umy jwml? (38.7) 


WM+1 j=0 


Its solution is given by 
wasii = Pax Hy ai MV 


Pusia = ASU ys + HR AuHyuuu] (38.8) 


Im = (Iu @ qg SUCH 


with 


Since 
the regularization matrix A T1444; is seen to satisfy 


MAHAN a = Lege 0 | 


0 q-1Ai-M-1 


Comparing problems (38.1) and (38.7) we see that they are of the same form as the order- 
update problems (32.1) and (32.7) studied in Sec. 32.1 with the identifications 


We can therefore invoke the result of Lemma 32.1 to relate {wm,i, WmM+1,i} as well as 
(Py; Puri] For our purposes, we are more interested in the latter relation, which 
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from Lemma 32.1 is given by 


——— aee A 
pM EEG) E. pi 
(38.9) 
where wir is the weight vector that projects h in (38.6) onto R(H m,i), namely it solves 


min | MDT yw, + (h — Hy iuh)* A(h — Hy iwl) | (38.10) 


ww 


and £5, (i) is the corresponding minimum cost. The cost function (38.10) can be rewritten 
as 


i 
min AM uM uS, + Soars lu(j — M) -umub (38.11) 


wh, j—0 


which can be interpreted as estimating each entry u(j — M) from the future values in uar, ;, 


UM, = [ u(j) u(j-1) ... uj -M +1) ] 


Hence, the use of the superscript ° to indicate a backward prediction problem. Let 


bys (i) = u(i — M) — UM M s 


denote the backward error that results from estimating u(i — M) from uy; it corresponds 
to the last entry of the residual vector h — H MW i- By invoking again Lemma 32.1 with 
the identifications 


Yz — 1M a (i), y| vu), à — by (i) 


we can relate the conversion factors {ym (i), Ym+1(i)} of problems (38.1) and (38.7) as 


lbm C) 


1M-i (i) = v (i) — TUX-M-» LED (38.12) 


This expression provides an order-update relation for the conversion factor. 
If we further multiply (38.9) from the right by 


UM4ii = Ma 
Ts u*(i — M) 


then we obtain an order-update relation for the gain vector as well: 


9M, | + bi, (7) | -whi | 


aMi 1 (38.13) 


9M+1,i = | 0 


38.2 REGULARIZED FORWARD PREDICTION 


Our second step toward evaluating the difference (38.4) is to derive both order- and time- 
update relations for the variables {ym (i — 1), gM,i-1, Pm,i-1}. That is, we now show how 
to go from these variables to {ym+1 (i), 91,5 Puri) 


Thus, consider again the matrix H m+1, in (38.5) but now partition it as 


_ 0 
Hus [ h | Huia | (38.14) 


That is, we now separate its leading column (as opposed to its trailing column in (38.5)) 
from the remaining columns. The remaining columns have a top zero row, which is repre- 
sented by the zero entry in (38.14). The entries of h are 


h = col(u(0), u(1), u(2),...,u(i)) (38.15) 


We then consider the problem of projecting the same vector y; onto the range space of 


es] 
Hyj-i 


as shown in Fig. 38.2. This is a projection problem of order M and it corresponds to 
solving 


i 
min | wj IHywy + Y X7|d() - um j-iwm]? (38.16) 
WM j=0 


where uar, = 0. We denote its solution by wai-1 and it is given by 
WM, i—-1 = Py i-1 Hy j-1Ai-1 D col(d(1), d(2), e.g d(i)} 


where 


Pyi-1 = [Am + Hy i-14i-1 Hmi- (38.17) 


Comparing problems (38.7) and (38.16) we see that we face the same situation studied 
in Sec. 32.2 on order-updates for forward prediction problems; just compare with problems 
(32.1) and (32.39) with the identifications: 


Y — Vi-i H e Hmi- WA 
II — A'IIy gent heh 
P, — Pu P e Pmi- 


Pu 


FIGURE 38.2 Projection of y; onto (col (0, Hm,:-1}) and R(Hm+1,i) with the corresponding 
inverse coefficient matrices (Py, ci, PM+1,i}- 
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We can then invoke the result of Lemma 32.2 to relate (Py i-1, PM+1,i}, namely, 


0 0 1 
P, i= + —— 
M+1, | 0 Puii | g-1M-i t | -w, i 
(38.18) 
where whee is the weight vector that projects h in (38.15) onto R(col{0, Hy; ; 1] ), namely 
it solves 


i 0 à 0 
sedate + fiaa ] e) n s E 
M | OM MUM Hyujia M i HmMi-1 UM 
(38.19) 
and € IO is the corresponding minimum cost. The cost function (38.19) can be rewritten 


as 


M 


ae 2 
min semen + Soars luli) - ums- | (38.20) 
j=0 


which, since um,—1 = 0, is also equivalent to 
ES Peur: 2 
min Nwlrlywi, + ya lul +1)- uy wj; (38.21) 
UM j=0 


This problem can be interpreted as estimating each entry u(j + 1) from its past values in 
UM, j? 


[uG) uG-1 .. wG—-M+1)] = uw; 


Hence, the use of the superscript ^ to indicate a forward prediction problem. Let 


. A R 
fuli) È uli) —umi-1w}y, 


denote the forward error that results from estimating u(?) from um, i—1; it corresponds to 
the last entry of the residual vector 


= 0 f 
a | Hy,i-i | whe 


By invoking again Lemma 32.2, with the identifications 
Ye — m+ (i), y — (i — 1), à — fuli) 


we can relate the conversion factors {ym+1 (i), Ym (i — 1)) of problems (38.7) and (38.16) 
as 


(38.22) 


we obtain a time- and order-update relation for the gain vector as well: 


S 0 fu) | 1 | 
9M4+1i = | Paes | + aie +e wip; (38.23) 
38.3 LOW-RANK FACTORIZATION 
Now subtracting (38.9) and (38.18) we find that 
Pu; 0) [0 0 _ 1 1 E". 
| 0 0 | | 0 Pmi- | i nid} + €f (à) | -uj, ; | | l -WM l 
1 CUM, EM 
| g-1AX-M-1 +E (i) | i | [ Ux 1 | 
(38.24) 


The coefficient matrices { Pai, Pm,i—1 } that appear in the above expression have the same 
order M. Dropping M, we can rewrite the above relation as 


Peu i hs 0 Ter cee tcd 
| 0 a E B | = ALL; (38.25) 
where the factors ( L;, T;) are defined by 
—WUW.i 1 
Aci A71 


(38.26) 
T; = diag ¢ —————À—, -— 

g 9-171 + SNO qg 7 31X-M-1 + & (i) | 
Alternatively, we can write (38.25) as 


E 0 


0 0 nae 7 ay* 
0 al = | | = ALS L} (38.27) 


0 Pia 


where S = (1 @ —1) and the columns of L; are given by 


- 1 
first column of L; = SANG AE | 5 | 
nAi + AE(i) L 7 "Ma 


mere 1] 
TNE | 1 


We have therefore achieved our objective of interpreting the columns of L;; they are scaled 
multiples of the forward and backward prediction vectors {w}; ;, wl, i). which are solu- 
tions to (38.11) and (38.21). 


second column of L; 
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Fast Fixed-Order Filters 


The arguments so far do not only provide an interpretation for the columns of L; but they 
also motivate other efficient RLS implementations. These alternative implementations are 
not in the form of array algorithms, as was the case in Chapter 37, but in terms of explicit 
sets of recursions. We start with the Fast Transversal Filter (FTF) form. 


39.1 FAST TRANSVERSAL FILTER 


Define the normalized gain vector 
A zy. 
CMj = gmiY (È) 


and observe that the RLS recursion (37.2) can be used to time-update the weight vectors 
{wy ;, wir} as follows: 


wii = whi + emabu(i) (39.1) 
DT E wri- + emi-ifu (2) (39.2) 


where cyy,;—-1 is used in the recursion for why, , instead of cys ;, since the forward projection 
problem (38.19) is based on Haz ;-1 while the backward projection problem (38.10) is 
based on Hy i. 

Moreover, writing expression (37.10) in terms of (ci. cy; 1). and using the low- 
rank factorization (38.25) at time (i — 1), we obtain 


pout = [8 | + d l | 
0 CM,i-1 nan? + & (i — 1) —WM i-1 


erent) a 
aaa ay | 1 | e» 


in terms of the a priori estimation errors 
am(i) = u(i)- UM i—1WYy a1) Bu (i) = u(i - M) - UM UM i-i 


which are related to the a posteriori errors {fm (i), bm (i)) via the respective conversion 
factors, i.e., 


bu (it) — yw (3)8M(i) and f(t) = Yu(i- 1)am(i) 


628 
Adaptive Filters, by Ali H. Sayed 
Copyright (c) 2008 John Wiley & Sons, Inc. 


Now the sum of the first two terms on the right-hand side of (39.3) can be shown to be 
equal to caz41,; — see Prob. IX.6. In this way, the update for cm, can be split into two 
steps. We first perform the time- and order-update: 


dat (i) | 1 
a 39.4 
IAF el) | -whei 9 


followed by the order down-date: 


—-1g* (; 
CMj A7 By) i (39.5) 


E | = oes = ]-13i-M-? + gb (i — 1) 


Using the fact that the last entry on the left-hand side of the above equality is zero, we find 
that 
XB) 


last entry of cai; = — r 
TY OROMALES a TXICM-L EG - 1) 


Hence, if we denote this last entry by vy, 1(i), then Bm (i) can be evaluated via the alter- 
native formula: 


Bu (i) = An 1007€ 7? + e (i — 1)] viu) (39.6) 


Moreover, since £ i (i) denotes the minimum cost of the regularized least-squares prob- 
lem (38.19), and since £5, (i) denotes the minimum cost of the regularized least-squares 
problem (38.10), they both satisfy time-update relations of the form (cf. Alg. 30.2): 


efi) = AXfü-1-ow()fuG) €f,(-1) =0 


€) AG — 1) + Bu (os), — (71) 20 


By collecting the relations we derived so far, along with two easily derived relations for 
the conversion factor (see Prob. IX.2), we obtain the so-called fast transversal filter (FTF). 
In the listing of the algorithm below, we have introduced the auxiliary variables: 


Qu) È 13i + e) 


Cha) È qx + Ehi, 


It is easy to see that these variables satisfy similar recursions to those of (€ 1, (i), EÀ}, 
namely, 


e CD = 71-3 


Che (1) = n AF?) 


hi) = ACH (i - 1) + ay (fu), 


Che (8) = ACK — 1) + By (Dbm D, 


albeit with different initial conditions. 


629 


SECTION 39.1 
FAST 
TRANSVERSAL 
FILTER 


630 


EES 
CHAPTER 39 
FAST 
FIXED-ORDER 
FILTERS 


Algorithm 39.1 (Fast transversal filter) Consider the same setting as Alg. 37.1 
with w = 0. The solution wy can be recursively computed as follows. Start 
with um,-1 = 0, yu(-1) = 1, em-1 = 0, wl, = 0, why , = 0, 
uM,-1 = 0, Ch (-1) = 97 d7?, C (—1) = q71A704*2. and repeat for 


i20 
ow(i) = u(i)- uwi-iwlri 
fuli) = "«w(-1l)oaw(i) 
GG = AGG- 1) + ahs (4) fru (4) 
yml) = mli- 1) Ad, G - 1)/Chr 0) 
wi, s - why 1+ fae 
„E 0 vy (i) | 1 | 
EM lan arem Why gn 
VM4i(i) = last entry of CM41,i 
Buli) = Xd G- Dv) 
ymi) = ymi(t)/[1 — Bu G)Yu- i (tym )] 
b 
| Eo | = CM4Li — VM+ (i) | "Edu | 


bu (i) = yu(t)Bu(i) 
Qu) = Auli- 1) + By )b) 


whee = whi- t+ du (iem,i 
e(t) = d(i) — UM iWi-1 
r() = ya(ie(i) 
wi =  Wiacr(i)bwa 


39.2 FAEST FILTER 


The fast a posteriori error sequential technique (FAEST) is similar to the FTF implemen- 
tation we described above except that FAEST relies on propagating the inverses of the 
conversion factors, rather than the conversion factors themselves. 

Thus, observe from (38.22), or from Prob. IX.2, that we can equivalently write 


* f: . fr 
) = opli- UOMO] n a puli) 
ce | do | "7 qq 
so that 
Yu = i-u - yq- 1) D) * eu Ofu O 


acf (i — 1) ACE (i — 1) 


and, hence, 


low G)? 


data) = Yap (6-1) + (39.7) 


ACE (i — 1) 


Likewise, from the relation in part (c) of Prob. IX.2 we get 


yu G) = Ymy (i) — Ba vi) 


All other equations in the FTF description remain unchanged. 


Algorithm 39.2 (FAEST filter) Consider the same setting as Alg. 37.1 with 
ij = 0. The solution wy can be recursively computed as follows. Start with 
uM,-1 = 0, X (71) = 1, eu, -1 = 0, wf, 0, why _) = 0, wm,-1 0, 
C, (71) = 971472, Ch (71) = g71A704*? and repeat for i > 0: 
am(i) = u(i) — um iiw ia 
fui) = amru ü- 1) 
GO = AHi- 1) + ah ( Fae) 
waali) = vl- 1) + leu AG 1) 
wiji = whee + fu(idemi-1 
0 ay (i 1 
aan | CMi-1 | B we 1) | -wri | 
VMai(i) = last entry of erui 
Buli) = Auli- 1a) 
Yu) = Cwau(0-8uGvuwa() 
| Es | = CM4ii —VMai(i) | Cat | 
bu(i) = BuO À 
MO = AMG- 1) + Puli)bm(i) 
whi = uM ici +bmli)cm,i 
e(i) = d(i)- uw; wii 
r(i) = eir À 
wi = UWiacr(i)wi 


39.3 FAST KALMAN FILTER 


The fast Kalman filter is another efficient RLS implementation (actually the earliest such 
implementation); its equations are in essence similar to what we have done so far for 
FTF and FAEST except that they rely on propagating the gain vector gm, instead of its 
normalized version cm,- The update of gm, is based on equations (38.13) and (38.23), as 
we shall explain below. Also, the fast Kalman filter does not use the conversion factors to 
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evaluate a posteriori errors from a priori errors. All error quantities are evaluated directly 
from their definitions. For this reason, FTF and FAEST are more efficient than fast Kalman 
— see Table 39.1. 

The time-update of the gain vector, from gm,i—1 to gaz,i, is performed as follows. First, 
equation (38.23) is used to update gm,i—1 tO gM+1,i, namely, 


"E. fu) | 1 | 
TER | 9M,i-1 | * SHORT 


Now from (38.13), we see that the last entry of the just computed gm+1,i is equal to 
b, (1)/ Cb, (1). We denote this last entry by o m+1 (1): 


omili) Ê lastentry of gui = 08, )/ CQ) 


Then gm, can be recovered from (38.13) as 
9M = 9M4ii[0: M — 1] + omy (iwi, 


where the notation 91441,:{0 : M — 1] denotes the top M entries of gm+1,:. If we further 
replace vii in the above equation by its update 


whee = wii t gMiPM() 


and solve for gmi, we get 


ee gm+i,i[0: M — 1] 


MMi = V eui BG) 


In this way, we arrive at the statement of the fast Kalman filter. 

For comparison purposes, Table 39.1 lists the estimated computational cost per iteration 
for the fast array method, FTF, FAEST, and fast Kalman assuming real data. The costs are 
in terms of the number of multiplications, additions, and divisions that are needed for each 
iteration. It is seen that FTF and FAEST require O(7M) operations per iteration, while 
fast Kalman requires O(9M) operations per iteration. 


TABLE 39.1 Estimated computational cost per iteration for fast array, FTF, FAEST, and fast 


Kalman, assuming real data. 

| Algorithm | x | + Le | 
|FAEST — | 7M «6 | 7M 2 | 5 | - | 
[FTF — |7M-«10| 7M@+1 | 3| - | 
| fast Kalman | 9M +2 | 8M-1 | 2| - | 


[fasamy | 6M+4 | 10M 16|6 | 2 | 


Algorithm 39.3 (Fast Kalman filter) Consider the same setting as Alg. 37.1 
with © = 0. The solution wy can be recursively computed as follows. 
Start with war—1 = 0, gu, -1 = 0, Vi, 4 = 0, wey a = 0, wm,-1 = 9, 
f (—1) = $7172, and repeat for i > 0: 


am (t) = u(t) = UM aU ia 
Wha = Ui ii t+am(i)gmi-1 
fu (i) = u(i) -— uM i—1W)y i 


d = Ad;G-1) + ay )fu) 


gM4ii = m: +42] i | 
M,i—1 gh (i) WM i 
am+1(i) = last entry of gri 
mli) = u(i- M)- UM AW i1 
gu; e MED: M - 
i 1 — om4+1(t)Gm (i) 
wi = whi- + 8umli)gmi 
e(i) = d(i)-umiWi-ı 
wi = Wii tgm ieli) 


ee ee ee EEE! 


39.4 STABILITY ISSUES 


Unfortunately, the fast least-squares algorithms of this chapter tend to suffer from nu- 
merical instabilities when implemented in finite precision. This is because the algebraic 
equalities that are used to derive the algorithms tend to break down in finite precision. A 
computer project at the end of this part illustrates some of these instability problems; the 
project also shows that some implementations are more reliable than others. 

Still, there are several ingenious methods that have been proposed over the years in or- 
der to combat the instability problem with varied degrees of success. In this section, we 
describe and comment on some of these techniques. 


Array Implementation 
We start with the array implementation of Alg. 37.1, namely, 


y(i- 1) [ u(i) uii | Bima 7/24) [ 0 0 | 
| 0 B "e | gar te) | AL 


gi-vy- V? (i — 1) 0 
(39.8) 


where O; is J-unitary with 


J=(1@S) and S-diag(1,—-1) 
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Although from a theoretical point of view, any O; that produces the zero entries in the 
first row of the post-array in (39.8) will do, different implementations of O; tend to lead 
to different numerical behavior. To explain this, consider for illustration purposes the case 
M = 3. Then the pre- and post-arrays will have the generic forms: 


In order to create the zero pattern in the first row of the post array, the J-unitary transfor- 
mation ©; can be constructed based solely on the first row of the pre-array, which means 
that only the information that is needed to update ^7 !/?(i — 1) to y~!/2(2) is used to deter- 
mine O,. In this way, no information from the other equations of the RLS algorithm (37.2) 
influences the choice of O;. 

An alternative way to construct O; is as follows. First, we create a zero entry in the first 
row of the post-array by means of a circular (Givens) rotation, using instead the entries 
(0, 0) and (0, 1) of the pre-array: 


8 $ x x' [0] x 

X X X x' x' x 
Givens / / 

x x x|—5|x x x 

X X X KP EKE e 

X X x x5 seb cx 


Now note that the additional hyperbolic rotation that is needed to zero out the remaining 
entry in the first row of the post-array, should also result in a zero entry in the (M + 1,0) 
position of the post-array. Therefore, rather than determine this hyperbolic rotation by 
using the entries (0, 0) and (0,2) of the first row of the pre-array, we can determine it by 
using the entries (M + 1,0) and (M + 1,2) of the last row of the pre-array. That is, 


x 0 x x" 0 0 
x! x! x x" x! x! 
x! x! x hyperbolic x!" x! x" 
ERI EET x" x! x 
9 x Q [0] x! x! 


This is a reasonable choice since, in this case, the construction of O; is affected by other 
RLS variables. In the computer project at the end of this part, it is observed that the array 
method is more robust to numerical effects than the other (explicit) fast least-squares vari- 
ants. 


Rescue Mechanisms 
With regards to the FTF, FAEST, and fast Kalman implementations, there are several 
mechanisms to rescue them from slipping into divergence. The main idea in these methods 
is to monitor a certain variable whose value is known theoretically to be positive. When 
the variable becomes negative, the algorithm is restarted as explained below. 

The rescue variable is chosen as 


rescue $ 1— Ba GO) ci ()v 41 (2) (for FTF) (39.9) 


which appears in the expression for evaluating ym (i) from yyz+1(7) in the FTF algorithm. 
At each iteration, the sign of the rescue variable is checked. If it is positive, the algorithm 
continues its flow. Otherwise, if the rescue variable becomes negative at some iteration to, 
the algorithm is restarted for iteration i, + 1 as follows: 


= f = b = 
Wi, = Wii, Wuyi, = 0, UM, — 0 


UMj, 70, cui, =0, vyM(ig) =1 (39.10) 
Quo) = 171470692, Qj (io) = TA? 


That is, all variables are set to their original initial values except for the weight vector w;,, 
which is set to the current solution. In a similar manner, for FAEST we choose the rescue 
variable as the quantity 


rescue Ya (f) — Bu (i)vu (i) (for FAEST) 
whereas for fast Kalman it is 
rescue $ 1— Cua (i8) (for fast Kalman) 
It has been observed in simulations that a more reliable rescue mechanism is perhaps to 


keep Ci, (io) at its current value and to re-initialize €3,(i.) by using the expression from 
Prob. IX.4, namely, 


; ; fps 
Qo) = at if lio) d Sia Cro) 


This expression enforces the relation that should hold between (£5, (io), € Í (io))- In this 
way, an alternative rescue mechanism is to restart the algorithm for iteration 7, + 1 as 
follows: 


3 HN ee 
Wis = Wis- Uy, =, Wuyi, =9 


UM, = 0, cw, = 0, am (ie) =1 (39.11) 
Celio) = Celio) /AM 


Feedback Stabilization 
A second approach for addressing the stability problem of FTF and FAEST is based on 
introducing redundancy into the computation of certain quantities. In so doing, it becomes 
possible to estimate the numerical errors accumulated in these quantities and to use these 
errors in a feedback mechanism in order to combat destabilizing effects. 

Consider the a priori backward prediction error 8m (i). In FTF and FAEST it is com- 
puted by means of the expression 


Bu (i) = AR (i - Dv) 


However, it could also be computed more directly (but also less efficiently) by using its 
definition, namely, 


Buli) = u(i — M) — uu ih i 


If we now modify the FTF and FAEST recursions to incorporate the evaluation of 8m (i) 
by means of both of these expressions, then a suitable combination of their values could 
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be interpreted as a better value for Bm (i). Stabilization by feedback is based on this idea. 
Specifically, the difference between (55, (1), 8, (i)) is fed back, scaled by a certain gain 
7, to evaluate 3m (i), i.e., 


Ba (1) By) + [8x G) — 8] 


Bu) = TBM ) + (1 — 7) Ba (2) 


The resulting equations are listed in Alg. 39.4; their computational complexity is O(8M) 
operations per iteration; the increase from O(7M) to O(8M) is due to the evaluation of 
B (i). The algorithm is more similar to FAEST than to FTF since it relies on propagat- 
ing the inverses of the conversion factors. It is distinct from both in that it propagates the 
inverse of € A (i) as well (see Prob. IX.5). 

The algorithm actually computes three convex combinations of {8 (i), 8% (i)} and 
evaluates three a priori errors: 


nBy ) + (1 — n)8) 


Tay, (i) + (1 — 2)8 (i) 
Ta (1) + (1 — Ta)bm (i) 


Each of these values is used at a different location in the algorithm. The challenge lies in 
selecting suitable values for the coefficients (71, 72, 73). Typical values are 


(1,725,735) = {1.5,2.5, 1} (39.12) 


but these may need to be adjusted (usually by simulation), because the efficacy of such a 
“stabilization” procedure is dependent on signal statistics. 
Observe further that the algorithm evaluates JM (i) in two different ways: 


y (i) = vla) — BY vaca Gi) 
and (cf. Prob. IX.4), 
vag (i) = Ch (8/08 AM 


It also evaluates yp}; (i) via 
duca) = Yap — 1) 8i (iam (i) (39.13) 


This recursion follows from (39.7) by noting from the expression for cm+1, in the state- 
ments of FTF or FAEST that its top entry is equal to o5, (2)/AC MO —]1y 


Omili) S top entry of cm+1 = ay (D/A (i — 1) (39.14) 


We may finally remark that, despite the qualification "stabilized FTF”, this procedure 
can still suffer from instability problems (see the computer project at the end of the chap- 
ter). Care needs to be taken in the choice of the scaling factors (71, 72, 73). In addition, 
the value of A needs to be sufficiently close to one, usually 

1 


-—<A< 
1 24 ^S! 


which limits the performance of the algorithm in nonstationary environments. Of course, 
one could also consider incorporating into the algorithm the rescue procedures (39.10) or 


(39.11). 
Algorithm 39.4 (Stabilized FTF) Consider the same setting as Alg. 39.1 with 
same initial conditions. Choose combination coefficients (1, 72,73}, for in- 
stance, as in (39.12), and repeat for i > 0: 
amli) = u(i) — ums), 
fuli) = amli) l-1) 
E 0 Lae (Mf 1 
cust = [og ] eu 6-9 T | 
vu4i(t) = last entry of CM+1,i 
841 (i) top entry of CM+1,i 
Yuu(0 = w"wG-1-*8«aG)eu() 
GG = AG G - 1) - va P Id 
wj, E Ub ca + fu (i)emi-1 
Buli) = Ada G- Dea) 
Bu) = w(i-M)-umwiwh i 
PO = n68WG -0-n) 
BRO = ruli) +U- 72) Bia (8) 
BP (6) = rbuli) + (1 — r)buli) 
wi (i) = imi) — BYP Ovma li) 
(Qu) = BY Om 
Da = pig 
MG) = Ad4G- 1) ORO 
CM, ; -whi 
0 = CM4ii — Yu silt) " 
wha = Ub ici EQ (tem 
Yu) = eG oO/ 
eli) = d(i — UM iWi-1 
ri) = e(i)/y Ò 
Wi = Wi-l +r(i)cm,i 


Incorporating Leakage 


Another approach to addressing the stability difficulties of FTF and FAEST is based on 
inserting leakage correction factors into some of the FTF equations. The motivation for 
using leakage is similar to what we discussed before for LMS in the concluding remarks of 
Part IV (Mean-Square Performance) and in Prob. IV.39. The complexity of the resulting 
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algorithm becomes O(11M) operations per iteration; but it can be reduced to 8M by 
applying leakage less often. 

The procedure consists in first reducing the number of recursive loops that are sensi- 
tive to numerical effects. In this regard, it computes ym (i) directly from its definition 
in (37.2), namely, ya¢(é) = 1/(1 + um,iCm,i), rather than recursively as in FTF and 
FAEST. Then a leakage factor e is incorporated into the update equations for the quantities 
(curi CM is Why jn Wha a}! 


eas (t 1 
CM41i = el 9 ie] | 
cmi- | Ad 1) L wie s-2 
viri = ewd, i t efu(i)eu i-i 
b 
i A| -wh 

CM, = CMjli— €UM 4A (i) M,i-1 

0 1 

wer i = eur ii + eb (i)eu 


where € is suitably chosen to be smaller but close to one, e.g., € = 0.98 or e = 0.99. The 
“stabilization” of the FTF recursions is achieved, however, at the expense of degradation 
in performance since the leakage factor introduces bias in the weight vector estimates, 
especially during the initial stages of adaptation. 


Summary and Notes 


The chapters in this part describe several efficient implementations of RLS. 


SUMMARY OF MAIN RESULTS 


1. 


By exploiting the fact that regressors have shift structure, the computational cost of RLS can 
be reduced from O( M?) to O(M) operations per iteration. 


. The key insight is to realize that the gain vector g;, and the conversion factor y(i), can be 


time-updated without requiring explicit evaluation of the matrix P;. Instead, only the low- 
rank difference 
0 0 


0 0 0 Pi-a2 


Ba | 7 


needs to be time-updated. 


. Interestingly enough, the regularization matrix II can be chosen so as to enforce a rank-two 


difference right at the first iteration. Once this is done, the rank-two property preserves itself 
over time. 


. Four efficient implementations were described in this part: (a) fast array filter, (b) fast transver- 


sal filter (FTF), (c) fast a posteriori error sequential technique (FAEST), and (d) fast Kalman 
filter. The array form is the simplest to derive and to describe; it also shows reasonable ro- 
bustness in finite-precision implementation, as illustrated in the computer project at the end of 
this part. The FTF and FAEST filters are the most efficient but tend to suffer from numerical 
difficulties. This is because many of the equalities used to derive them tend to break down in 
finite precision. The fast Kalman filter is the oldest among them. 


. The derivation of FTF, FAEST, and fast Kalman is based on studying forward and backward 


prediction problems, and on showing that the update of the gain vector can be described in 
terms of the weight vectors for the forward and backward prediction problems. 


. The derivation in the chapters assumes that regularization is present throughout. 


. Several stabilization procedures are described including: (a) rescue mechanisms, (b) feedback 


stabilization, and (c) leakage. 


. In Prob. IX.13 we show that fast RLS filters can be derived even for cases involving regressors 


without shift structure; another important example to this effect is discussed in Chapter 16 of 
Sayed (2003) in the context of RLS Laguerre filtering. 


BIBLIOGRAPHIC NOTES 


Regularization. In comparison to conventional derivations of fast fixed-order filters, we have in- 
corporated regularization into our arguments right from the early stages of the derivation. 
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Conversion factor and original derivations. The variable ~(i) plays the role of a conversion 
factor, which converts an a priori error into an a posteriori error. This is a useful property since it 
allows us to evaluate an a posteriori error before updating the relevant tap-weight vector itself. This 
property was exploited to great advantage in our presentation, as well as in the original derivations of 
the fast a posteriori error sequential technique (FAEST) by Carayannis, Manolakis, and Kalouptsidis 
(1983) and the fast transversal filter (FTF) by Cioffi and Kailath (1984). Further related works on 
efficient implementations of the least-squares type appear in Moustakides and Theodoridis (1991) 
and Glentis, Berberidis, and Theodoridis (1999). 


Fast Kalman filter. The fast Kalman filter of Sec. 39.3 is the earliest fast RLS filter. It was 
developed by Ljung, Morf, and Falconer (1978) and was also used in Falconer and Ljung (1978). 


Fast array filter. The derivation that led to the array form of Alg. 37.1 is a streamlined version of a 
derivation given by Sayed and Kailath (1992,1994a,1998). This array form is actually a special case 
of a more general algorithm known as the extended Chandrasekhar algorithm (37.26), and which 
was derived by Sayed and Kailath (19942) as a fast state-space estimation method for models whose 
parameters vary in a structured manner. This connection with state-space estimation was explained in 
Sayed and Kailath (1994b) and is also presented in App. 37.A and Prob. IX.16 (see also Prob. IX.13). 


Backward consistency. As explained in Chapter 39, fast fixed-order least-squares filters are 
sensitive to round-off errors and tend to suffer from instability problems in finite word-length imple- 
mentations (see, e.g., Botto (1988) as well as the computer project at the end of this part). Using the 
concept of backward consistency, Regalia (1992,1993) and Slock (1992) provided an explanation for 
the origin of such difficulties. Their analyses indicated that the dynamics of a fast transversal filter 
(FTF) contains unstable modes that are not excited in exact arithmetic but that become active in finite 
word-length implementations. Appendix 14.C of Sayed (2003) provides a summary of their results; 
see also Bunch, Leborne, and Proudler (2001) for a reference dealing with the issue of consistency 
for adaptive signal processing. It was to address such numerical problems that Lin (1984), Cioffi 
and Kailath (1984), and Eleftheriou and Falconer (1987) suggested using rescue variables in order 
to monitor the behavior of the FTF algorithm and to re-initialize it whenever abnormal behavior is 
Observed (as was discussed in Sec. 39.4). 


Rescue schemes. Most rescue mechanisms monitor the value of the conversion factor to detect 
filter divergence. This is because it is known beforehand that the value of the conversion factor should 
lie inside the interval (0, 1]. Motivated by this fact, Lin (1984) suggested monitoring the rescue 
variable (39.9). Cioffi and Kailath (1984) employed a similar rescue procedure, in addition to adding 
a small white noise to the input data (a technique known as dithering). The purpose of dithering was 
to avoid the matrix P; from becoming close-to-singular in steady-state. This method introduces a 
small bias in the weight estimate and is only effective if A is chosen as A > (5M — 1)/(5M + 1), 
where M is the filter order. 

Another technique for improving the reliability of fast least-squares filters is the one based on 
introducing feedback into the operation of the filters in order to influence the propagation of the 
errors. This technique resulted in the stabilized FTF variant of Alg. 39.4 by Slock and Kailath 
(1988,1991). The technique was also applied to the fast Kalman filter by Botto and Moustakides 
(1989). Still, even with "stabilization", the stabilized FTF algorithm can suffer from numerical 
difficulties especially when the forgetting factor is not sufficiently close to unity. In Slock and Kailath 
(1991), it was suggested that À should be chosen within the range (2M — 1)/2M < A < 1. Yet 
another rescue technique is the one based on inserting leakage into some of the FTF recursions, as 
described in Sec. 39.4. This technique was proposed by Binde (1995). Apparently, the performance 
of the method does not depend on the input signal statistics. However, the "stabilization" of the FTF 
recursions is achieved at the expense of degradation in performance due to bias, especially during 
the initial adaptation phase. The overall complexity of this procedure is O(11M) operations per 
iteration; nevertheless, the complexity can be reduced to 8M, by applying leakage less often. 


Fast filtering methods. The fast algorithms of this part, array-based and otherwise, have connec- 
tions with a fast alternative to the Kalman filter for constant state-space models known as the Chan- 
drasekhar filter. The filter is described in App. 37.A and it was originally developed in the early 1970s 


by Kailath, Morf, and Sidhu (1973) and Morf, Sidhu, and Kailath (1974), as an extension to discrete- 
time of results derived in continuous time by Kailath (1972,1973). Interestingly, the continuous-time 
results were motivated by earlier works by Ambartsumian (1943) and Chandrasekhar (1947a,1947b) 
on radiative transfer theory. Lindquist (1974,1976) also independently obtained a fast algorithm for 
discrete-time filtering; albeit one that is specific to stationary processes. The array version of the 
Chandrasekhar filter appeared in Morf and Kailath (1975) and Kailath, Vieira, and Morf (1978). 


Extended Chandrasekhar filter. Connections between the original Chandrasekhar filter (37.24) 
for state-space estimation and fast RLS algorithms have been discussed by Houacine and Demo- 
ment (1986), Slock (1989), and Houacine (1991) by formulating time-invariant state-space models. 
However, the state-space model that arises in the context of adaptive filtering is not constant (as we 
saw, for example, in model (31.8)); this is because the regressor u; varies with time. Motivated by 
this observation, Sayed and Kailath (1992) showed how to extend the original Chandrasekhar filter 
to a class of structured state-space models (cf. (37.25)), a special case of which is model (31.8). 
They derived the extended Chandrasekhar filter (37.26), which is an efficient estimation procedure 
for state-space models that vary in a certain structured manner. In this way, the relation between the 
Chandrasekhar filter and fast RLS methods becomes more natural and also broader (see Probs. IX.13 
and IX.16). In addition, the extended Chandrasekhar filter can be used to derive fast RLS methods 
even for some non-tapped-delay-line structures (as was shown by Merched and Sayed (1999) and as 
is discussed in detail in Chapter 16 of Sayed (2003)). 
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PROBLEMS 


Problem IX.1 (Initial conditions) Refer to the discussion in Sec. 37.3 on initial conditions and 
let II be as in (37.13). 


(a) Verify that in this case 37 = mA diag [rg 00,7]. 
(b) Conclude that 
1 0 
0 0 
T n ; A 1 0 
Lo = , | —. : ; " So = 
ST YTE MOP | S| ° E A 
0 0 
0 AMA 


along with wo = w + go[d(0) — uow] and 


y "(0 -4.1-mAWw(OP, go 


nA x 
EEES a eol 20 hang 
TF au? col(u" (0), 0 0) 


Problem IX.2 (Order- and time-update of conversion factor) Start from expression (38.22) 
for 4a 41 (1) in terms of ym (i — 1) and rewrite it as 


. (fu) 
M41 (i) = yu (i — 1) | 7 xem | 


(a) Since € MO, denotes the minimum cost of the regularized least-squares problem (38.19), it 


satisfies BRAO) = agf (i — 1) + am(i)fm(î) (cf. Alg. 30.2). Use this relation to show that 
the above expression for yxz+1 (7) leads to the equality 


ym+li)/ym li- 1) = AG - 1)/6h (i) 
(b) Likewise, show that 
ym+(i)/ym(i) = ACK - 1)/6u (å) 
(c) Use part (b) to show that 1 — |Bm (i) yi (/AC S. — 1) = 1 — bu (DBM), 
and conclude from (38.12), and from the definition of vm +1 (4) in (39.6), namely, vm+1 (i) = 


B5, (i)/AC$ (i — 1), that 


ym+ı(i) 


CARER] 
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Problem IX.3 (Relation between conversion factors) Use the result of Prob. IX.2 to show 
that 


ym+i(t)/ym(@i-1) S1 and yi (é)/ym (i) € 1 


Problem IX.4 (Time-update for conversion factor) Use again the result of Prob. [X.2 to de- 
rive the time-update relation: 


eG ha- 


y(t) = ym(i- 1) 


GG-D qd 
Conclude, by iterating this result, that y (i) = AM C, (i) /CL, (i). 


Problem IX.5 (Updating the inverse cost) Start from Ç$; (i) = ACh (i — 1) + o (i) f (i), 
and the matrix inversion lemma (30.11) to the right-hand side, and use (39.7) to conclude that 


1/6) = A176. — 1) = Yi (Om)? 


where m+: (1) is defined by (39.14). 


Problem IX.6 (Order- and time-update of gain vector) In this problem we want to establish 
relation (39.4). 


(a) Use the result of part (a) of Prob. IX.2, and (38.23), to show that 


0 AW avy (i) 1 
i= + jc. ADM 
Tm | Mi vent erreur FEGI) | -whi 


Conclude that 


act 0 Ene AT aili) 1 
ni - a S 
ai gm,i-1 | m+ (i) n7Ai-2 + El (i — 1) -whi 


(b) Use (39.2), namely, wi, ; - ud, ici + 9m,i-1Yq7 (6 — 1) fi (1), and substitute whys into 


the expression of part (a) in terms of PE in order to establish (39.4). 
Problem IX.7 (Triangular factorization) Introduce the notation 
Cha) È q7A7 +), ko01..,M-1 


Iterate relation (38.9) to conclude that the matrix Pm, admits the upper-diagonal-lower triangular 
factorization Pm, = Uy, Dy iU, where 


ydo Jo E LR 
Dui= 1/6 (i) tis 1 x 
EMO ur 


where the columns of Um, are defined in terms of the backward prediction vectors 


Spb: 
| cul k—0,1,...,M-1 
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Problem IX.8 (Determinant expression) Show that 


det Pui = Hi (aa) 


k=0 
Conclude from part (b) of Prob. IX.2 that 


det Pm,i 


ORC M 
ym (i) A det Pm,i-1 


Problem IX.9 (initial iterations of FTF) Refer to the FTF recursions in Alg. 39.1. Verify that 
the variable va +1 (1), and consequently the errors {8m (i), bm (i)}, are zero for i = 0 to M — 1. 


(a) Conclude that during these initial time instants the FTF recursions are given by: 


fori-0,1,..., M-1: 


u(i) — UM i—1W)y i- 1 


fw) = "*w(i- l)am(i) 

Chi) = Ad,G-1) oM) fae (3) 
210) = mli- 1) ACh (4 — 1/6) 
wi, i = wir i-1 + fu (iem 1 


as]  f 0 |, ae po: 
0 CM,i-1 AL (i - 1) -whi 


u = Auli- 1) 
d(i) — UM, iWi-1 
ym (i)e(2) 

wi-i + r(i)ew, 


[s] 
~ 

e. 
— 

H 


3 
= 
"ee 
— 

l 


(b) Use (38.1) to argue that wm -1 is the solution to the least-squares problem: 
min [war + (ym—1 — Hy-iw)' Au-i(yw-i — Hu-1w)| 
where Hm-1 is lower triangular Toeplitz with first column (u(0), u(1),...,u(M — 1)}. 


Problem IX.10 (Rank-three factorization) Use the recursion for P; from (37.2) and (38.25) to 
show that 


Problem IX.11 (Generating functions) Introduce the basis function 
Bule) È [1 eta any uus aD AM ] 


and use it to transform the matrix and vector quantities (P, whi wh is gi} into functions of z and 
s as follows: 


P(zs) = Bu(z)BBu(s,  9(z) = 2 v ^" (i)Bu(2)g: 
À 1 À -whi 
wi(z) = qoo | Aa | , w(z)= cgo | P l 


(a) Show that the equality of Prob. IX.10 reduces to the function equality: 


(b) Choose s = z and conclude that because P; is positive-definite, the following inequalities 


should hold: 
>0, jz} >1 
w (z)w!*(z) + 9(z)g*(z) - w'(z)w" (2) 4 20, |z] 21 
«0, |z| «1 


Conclude that w? (z) cannot have zeros inside the unit circle. 

(c) Verify that w°(z)w*(1/z*) = wi (z)w/* (1/2*) + g(z)g' (1/z*). 
Remark. This equation can be used to argue that w°(z) is uniquely defined by {wf (z), g(z)) (more 
specifically, it is the spectral factor of the right-hand-side expression). Moreover, the quantity e (N) 
can be inferred from the last coefficient of w? (z) — see Regalia (1992,1993). 


Problem 1X.12 (Leading and trailing columns of Py7+1,:) Show that the last column of Pu... 


is proportional to the backward prediction vector col(—w$, i, 1). Show further that the first column 
of Pm+1, is proportional to the forward prediction vector col{1, ETA i 


Problem IX.13 (General fast array algorithm) Consider data (u;, d(7)) 2.6, where the u; are 
1x M and the d(j) are scalars. Consider also an M x 1 vector Ù, a scalar 0 < A € 1, andan M x M 
positive-definite matrix II. Assume further that for a particular choice of II there exists some M x M 
matrix V such that the difference P; — V.P. 2V* is low rank, say P.1 — VP 93V* = L_,SL*, 
for some L.., and signature matrix S. It is further assumed that the regressors satisfy u;U = uj... 
Follow the discussion in Chapter 37 to show that the solution wx of the least-squares problem 


N 
min pee — i)'I(w — à) + YA Ia) - z 
j=0 


can be recursively computed as follows. Start with w_1 = d, y7™?(—1) = 1, g-1 = 0, L-1 from 
the above low-rank factorization, and repeat for i > 0: 


4M? AW 1) Aui Diy O 71/2 (3) 0 
Wgiciy (i71) ATP ^ a 70) Li 


where O; is a (1 © S)—unitary matrix that produces the zero entries in the post-array, and where the 
factor L; satisfies P; — VP; V" = LSL}. Remark. These recursions are a special case of the extended 
Chandrasekhar filter of Sayed and Kailath. (1994a,1994b) for structured state-space models — see App. 37.A. 
They form the basis for the derivation of fast RLS Laguerre adaptive filters — see Sec. 16.3 of Sayed (2003). 


Problem 1X.14 (Array algorithm for sliding window RLS) Refer to the finite-memory RLS 
algorithm derived in Prob. VII.34. Follow the arguments in Sec. 35.1 to motivate the following array 
algorithm, Let © = II^! and introduce the Cholesky decomposition X = X1/?Y;*/?, where X/? is 
lower triangular with positive-diagonal entries. Then start with wł; = ù, (PL)? = XV?, and 
repeat for i > 0. 
1. (downdating step). Find a (1 © —I)—unitary matrix O¢ that lower triangularizes the pre- 
array shown below and generates a post-array with positive diagonal entries. Then the entries 
in the post-array will correspond to 


1 wr (P: aue. ar ed) S 
1/2 i7 - ; - 
0 (Pi) -gk i =1) (P) 


2. (updating step). Find a unitary matrix O7 that lower triangularizes the pre-array shown 
below and generates a post-array with positive diagonal entries. Then the entries in the post- 
array will correspond to 


l uw (PZT) pi e E aL) * 1/2 
gra (PE) 
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646 3. Evaluate the errors eq(i — L) = d(i — L) — ui- zwi, and eu (i) = d(i) — uwa. 


Part IX 


PROBLEME 4. Update the weight vector from w; to wP as follows: 


& 
E 
I 


E? = wha (s^ 6-0) (az76-2) et- 0 


- i! - » -1 d 
wt = witht (af PO) (Q9) «0 


Problem IX.15 (Array Chandrasekhar form) Refer to the discussion on the Chandrasekhar fil- 
ter in App. 37.A. Compare entries on both sides of (37.23) and identify the terms {X, Y, Z} as in 
(37.24). Likewise, verify the validity of the extended Chandrasekhar recursion (37.26). 


Problem IX.16 (Fast array RLS and Chandrasekhar filter) In Sec. 31.2 we showed that the 
exponentially-weighted RLS algorithm can be obtained via equivalence to the Kalman filter as fol- 
lows. We start from the state-space model (31.8) and write down the Kalman recursions for estimat- 
ing its state variable. Then we translate the Kalman variables into the RLS variables by using the 
correspondences summarized in Table 31.2. We used this procedure in Probs. VIII.12 and VIII.13 in 
order to examine the relation between RLS array methods and Kalman filtering array methods. 

In this problem, we use the same reasoning to explain the connection between the array Chan- 
drasekhar filter of App. 37.A and the fast array RLS method of Alg. 37.1 when u; has shift structure, 
i.e., when ui = [ vi) u(i—1) ... uGi-M+1) k To do so, we need to replace the time- 
variant state-space model (31.8) by a structured state-space model of the form defined by (37.25). 

Thus consider the least-squares formulation (37.1), with N + 1 data points, and assume the 
regressors u; have shift structure. Introduce the following (N + 1)—dimensional state-space model 
(in contrast, model (31.8) is M —dimensional): z;41 = A~1/?ar, and y(i) = hia; + v(i), with hi 
defined as the (N + 1)-long vector h; = [u(i) u(i — 1) ... u(0) O01xN-i-1]. That is, ^; has all 
the input data from time 0 up to and including time 7. The remaining entries are zeros. Moreover, 
the trailing N — M entries of the state vectors v; are taken to be zero, i.e., v; = col( x Ou xi), 
so that his; = uix; for all i. Let 


IIo 0 


Ev()v'(j)-ój, Evora = 
O Ow-M)x(N—M) 
(a) Verify that h; satisfies hi = hi41Z, where Z is the lower triangular shift matrix with ones on 
the first subdiagonal and zeros elsewhere. Conclude that the state-space model defined above 
is structured (i.e., it satisfies (37.25) with V; = Z). 


(b) From the Kalman recursions (37.22) for the above model, i.e., from 
Sui = A iia + ky [y li) — hiĉii-1] 
ret) = 1+ hiPy-ihi 
kpi = NP ht /re(i) 
Pay = A [Bua — Pia hi hiPyi-a/re()] , Poj-1 = (Io © OW-M) x(w-™)) 


verify that the normalized gain vector kp arl KO) has trailing zeros. Specifically, argue that it 
has the form 


kpr”? (i) = | e | 


ON-m)x1 
for some M x 1 vector ĉ;. 
(c) Verify that the extended Chandrasekhar recursions (37.26) that correspond to the above model 


are given by 
r"(- 1) hiL i-a ra? (i) [ 0 0 | 


Gj Lajas 0; = a. - 
Z Hed AVE, ica : e Lii-i 
Otv-M)xi ON -M)yxi 
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where ©; is any (1 @ S)-unitary matrix that lower triangularizes the pre-array. Moreover, 
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(Loi, S) are found from the factorization P, jo — ZPo|-12* = Loi-18L5.,. 


(d) Argue that Z,);-1 also has trailing zeros and write it as 
Lij- = Lai 
OtN-M-1xe 
where La is (M + 1) x a. Let Ri denote the row vector with the first M + 1 entries of 


hi. Verify that the Chandrasekhar recursions of part (c) reduce to 


ri; -1)  hL Ww 
| 0 | te PRU ©, = | či | Pus 
Ci-i 0 

(e) Use the correspondences from Table 11.1 between RLS and Kalman variables to show that 

the above Chandrasekhar filter leads to the fast array method of Alg. 37.1, namely, 
347i? (i— 1) [ u(i) wii | že- 7M) [ 0 0 | 

7 © = | giY- 1? (i) | ASL 
0 


0 
Li-i 
| gi-vy V"? (i — 1) x 


Problem IX.17 (Fast multichannel RLS) Consider N FIR channels with (Mi, k — 1,..., N} 
taps each. Let (u(? (i), k = 1,..., N} denote the input sequences to these channels with the 


corresponding regression vectors and tap vectors denoted by 
v9 - M1], ken 


uf 8 [aa v9 -1) 
and (w^, k = 1,..., N}. The outputs of the channels are combined together to yield the noisy 


measurement d(i) — "em uP yo) + v(i) — see Fig. IX.1. 


uh) (4) 


ua) 


FIGURE IX.1 Data transmission over N FIR channels. 
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This formulation can be recast in terms of a single-channel description by collecting all regressors 
and tap vectors into a single extended regressor and a single extended tap vector: 


u £ [ 49 uP s. Vo I we ê edi ur ao), a0 


so that d(?) = uiw? + v(i). Therefore, given measurements (d(i), ui} we can estimate the weight 
vector w^ and, subsequently, the individual channel weight vectors {wh }, by applying an adaptive 
filter (of the LMS or RLS type) to the data {d(i), u:}, e.g., by using 


wi = wi-1 + wu; [d(i) — uiwi-i] (multichannel LMS) 


or 
48) = 1/11 +7" ui Pj-1ut) 
g = AUW()Piiw 
wi = wi-r+gi[d(i)-—uswi-1} (multichannel RLS) 


Po = AP. -ggt/o(i) 


The computational cost of the LMS implementation will be O( M) operations per iteration, while 
that of RLS will be O(M?) operations per iteration, where M = (Mi + M2 +... + My). 

In this problem we are interested in showing that in the case of RLS, an efficient O( M) imple- 
mentation can be pursued by observing the following. Although, strictly speaking, the regressor u; 
is not shift-structured, it still consists of individual vectors (uf) that are shift-structured themselves. 
This fact can be exploited to derive fast RLS implementations. 


(a) Define ] 
[n] 2 n-diag{d?,A3,...,@**2} (Mp x Mx) 


and choose the regularization matrix TI (i.e., P-.1) as II = diag(II? , 1), . . . , I1? ), De- 
fine further the (M; + 1) x 2 and 2 x 2 matrices 


1 0 
0 0 
F a 4 1 0 
L® = à 1 : , =|: 2 
0 0 
Q AMe/2 


as well as the (M + N) x 2N matrix L1 = diag( LU), LO), LO), ...,£% }, and the 
2N x 2N signature matrix S — diag( 9, SQO,..., SM}, Partition g; into 


gi = col(g(9,o(9,..., 9?) — (M x1) 


where each gf E) is Mp x 1. Show, by either repeating the derivation of the fast array method 
of Chapter 37 or by appealing to equivalence with Kalman filtering as in Prob. IX.16, that the 
1 gf} can be propagated iteratively as follows: 


471 - 1) [uM a. ul) uM E 


0 
gy P -1) Ə; 
: diag {L,, 109... 10) 
0 
gren 


7/204) Oix2N 
gyi) 
= 0 
VX diag {L, L,..., 20} 
gy M (a) 


0 


where each P is (My +1) x 2N and ©; is any (16 S)-unitary transformation that produces 
the zeros in the top row of the post-array. 


(b) The computational cost of the array method of part (a) is O(2N M) operations per iteration. 
It can be reduced to O(2M ) operations per iteration when all channels have the same number 
of taps, say Mi = K. Consider again the N x N regularization matrix II and let (II^?) 
denote its (k, 2) block; each such block is K x K. Rather than select II as a block diagonal 
matrix, as was the case in part (a), we now choose its diagonal and off-diagonal blocks as 
follows: 


-1 
[n^] = n-diag{A?, A3,...,%*7}, (K x K) 


Show that the same array algorithm of part (a) will still hold with L.. now defined as the 
N(K +1) x 2 matrix 
LO 


7@) 
Le; 
- "AN 
LU 
and S = (1 @ —1). Moreover, the 01 «2x block in the post-array should be replaced by 012 
and each L‘*) is now (K +1) x 2. 


Remark, For a discussion on multichannel least-squares problems and equivalence with Kalman filtering, see 
Khalaj, Sayed, and Kailath (1993). 


Problem IX.18 (Adaptive Volterra filtering) A second-order Volterra filter is described by a 
nonlinear mapping of the form 


M-1 M-1M-1 
di = Y^ w'G)u-3) + DO Y, wG kuli- j)u(i - k) + vli) 
j=0 j20 kzj 


where (u(i)) is the input sequence, (v(?)) is the noise sequence, and {w°(j), w°(j, k)) denote 
the filter coefficients. Compared with a standard FIR mapping, we find that products of the form 
[u(i — j)u(i — k)) also appear. We can reformulate this problem as a multichannel filtering problem 
by defining M + 2 channels with the following regression vectors: 


uf) 2 | ul) u(i—-1) ... u(i- M 41) | (channel 1; M taps) 

ONE [ a) w(i-1l) ... v - M1] (channel 2; M taps) 

ul È DaQuG-1) ... wG- M-2)G- M1) ] (channel 3; M — 1 taps) 
u & | u(i-1)ui—2) ... w(i— M -3)u(i - M +1) ] (channel 4; M — 2 taps) 


ul â [ u(i + 1)u(i — M +1) ] (channel M + 2; 1 tap) 
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PROBLEMS w™® 2 [ w*(0) ... w*(M-1) | (channel 1; M taps) 
we ê [ w*(0,00 ... w'(M-1M-1) | (channel 2; M taps) 

w*9 2 [ w°(0,1) ... w*(M-2,M -1) ] (channel 3; M — 1 taps) 

(9 à [ w(1,2) ... w*'(M-3,M-1) ] (channel 4; M — 2 taps) 


wo M2 & [ w°(0, M — 1) | (channel M + 2; 1 tap) 


Again, this formulation can be recast in terms of a single-channel description by collecting all re- 
gressors and taps into a single extended regressor and a single extended tap vector: 


u 2 [ uf? wo al) ks w È col{w™, w... uo M92 


so that d(i) = u;w? + v(i). Therefore, given measurements {d(i), ui}, we can estimate w^ and, 
subsequently, the individual taps {w° (j), w°(j,&)}, by applying an adaptive filter (of the LMS or 
RLS type) to the data (d(i), ui} as in Prob. IX.17. In the RLS case, show that the fast array solution 
of part (a) of Prob. TX.17 still applies in this application. 


Remark. For more details on Volterra filters, their modeling abilities, and their use in the context of adaptive 
filtering, see Schetzen (1980) and Mathews (1991b). See also Khalaj, Sayed, and Kailath (1993) for additional 
discussion on the results of this problem. 


Problem IX.19 (Laguerre adaptive filters) Consider an adaptive transversal filter structure with 
multiple poles {|a} < 1}, as shown in Fig. IX.2. Let ® = 4/1 — |ax|? and introduce the regres- 
sion vector u; = [u(i,0) u(i,1) ... u(i, M — 1)], where the notation u(i, 7) refers to the j-th 
entry of u;. Show that two successive regressors satisfy the structural relation {u(i,0) wii} = 
[us u(i — 1, M — 1)]V, where V is the (M +1) x (M + 1) matrix, e.g., for M = 5, 


1 a6 0 0 0 0 
0 614 ai 0 0 0 
v= 0 —a1626o 0501 a3 0 0 
0 018203606 — 20301 03505 a3 0 
0  -a14a20360/64 020304 /04 — 0302/04 05/04 0 
0 0102030400 /04 —02030401 /04 030402 /04 —a403/04 1 


Remark. We thus find that the successive regressors of the filter structure of Fig. IX.2 satisfy a certain structural 
relation. Comparing this relation with the result (37.5) in the shift-structured case, we see that it reduces to 
(37.5) when the {ax} are all zero since then V = I. As in the case of tapped-delay-line implementations, the 
result of this problem can be exploited to great effect to derive fast fixed-order (and also order-recursive) least- 
squares algorithms for Laguerre filters as well; see Chapter 16 of Sayed (2003) and also Merched and Sayed 
(2000b,2001a,2001b) for details. 


COMPUTER PROJECT 


Project IX.1 (Stability issues in fast least-squares) The purpose of this project is to illus- 
trate some of the stability problems that arise when dealing with fast fixed-order least-squares al- 
gorithms. We do so by considering the same adaptive channel estimation application shown in 
Fig. VIIL1. 


(a) Load the file channel.mat. It contains the 64 samples of a randomly generated channel im- 
pulse response sequence. The norm of this impulse response has been normalized to unity. 


(b) 


FIGURE IX.2 An adaptive transversal structure with multiple poles. 


Feed unit-variance Gaussian input data through the channel and add Gaussian noise to its 
output. Set the noise power at 30 dB below the input signal power. Train an adaptive filter 
using: 


1. The fast transversal filter of Alg. 39.1; 

2. The same FTF algorithm but incorporating the rescue mechanism (39.10); 
3. The stabilized FTF implementation of Alg. 39.4; 

4. The fast array filter of Alg. 37.1. 


For each algorithm, generate an ensemble-average learning curve by averaging over 100 ex- 
periments. Compare the performance of the algorithms. : 


All computations in MATLAB are performed in double precision. In order to illustrate some 
of the instability problems that arise in finite precision, we repeat the simulation of part (a) in 
a quantized environment. To do so, you may rely on the routine quantize.m from Computer 
Project VIII. 1. Usually, in practice, the internal variables of a filter implementation are suit- 
ably scaled in order to reduce the occurrences of overflow and underflow when operating in 
finite precision. In order to simplify this project, and thereby avoid the need for incorporating 
scaling, we shall choose the number of bits to be relatively high. Thus fix the word-length for 
both data and coefficients at B = 36 bits (including the sign bit). Although high, this number 
of bits will still be useful to illustrate some quantization effects. 


Repeat the simulation of part (a) and generate ensemble-average learning curves for each of 
the algorithms. Compare their performance. 
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aueren 40 


Three Basic Estimation Problems 


Ti recursive least-squares algorithms described so far in Parts VIII (Least-Squares 
Methods) and IX (Fast RLS Algorithms), including array variants and fast least-squares 
variants, are usually qualified as fixed-order algorithms. The qualification “fixed-order” 
means that, from one iteration to another, these implementations propagate quantities that 
relate to estimation problems of fixed-order. 

In this part, we shall study RLS algorithms that are order-recursive in nature, as opposed 
to fixed-order. They are widely known as lattice filters and have several desirable proper- 
ties such as improved numerical behavior, stability, modularity, in addition to computa- 
tional efficiency. In these implementations, least-squares problems of increasing orders 
are solved successively so that, in addition to time-updates, the lattice filters rely heavily 
on order-updates for various quantities. 

Our treatment of least-squares lattice filters has at least three features: 


1. First, all relevant order-recursive relations are derived without assuming any struc- 
ture in the regression vectors. 


2. Second, and because of the above, the derivation is able to show that it is possible 
to design efficient lattice filters even for cases where the regressors do not possess 
Shift structure. This generalization is achieved by pinpointing the variable whose 
update is affected by data structure, and by showing what kind of structure enables 
an efficient order-recursive relation for the variable. 


3. Third, all order-recursive relations are derived by solving regularized least-squares 
problems, as opposed to standard least-squares problems without regularization. 


We start our presentation by motivating the need for lattice filters and by introducing the 
notation that is necessary to describe such filters. The reader will soon realize that, while 
the derivation of this class of filters relies on familiar concepts, it is the excessive use of 
subscripts and superscripts that may confuse the uninitiated reader. The unfortunate truth 
is that the notation is necessary, and the reader needs to become familiar with it. The good 
news is that the notation is suggestive and very much tied to the context. Moreover, the 
underlying concepts and arguments are fairly familiar by now. 


Notation for Order-Recursive Problems 

To study order-recursive problems, it is necessary to adjust the notation in order to be able 
to indicate both the size of a variable and the time instant at which it becomes available. 
For example, when referring to a weight vector w; at time i, we shall write wm, with two 
subscripts, M and 7. The first subscript, M, is used to indicate that the weight vector is of 
size M or, equivalently, that it is computed as the solution to a least-squares problem of 
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654 order M, as in (40.1) and (40.3) below. The second subscript, 2, is used to indicate that the 


CHAPTER 5 - weight vector is dependent on data up to time 7 and, therefore, becomes available at time i. 
ESTIMATION Ina similar vein, we shall write H m instead of H; to refer to a data matrix with column 
PROBLEMS dimension M and with data up to time 7. Similarly, we shall write II m instead of II to refer 


to an M x M regularization matrix. With this notation, we can now provide a brief review 
of the regularized least-squares problem. 


40.1 MOTIVATION FOR LATTICE FILTERS 


So consider a collection of (i + 1) data (d(j), us; )5..o and introduce the observation 
vector y; and the data matrix H m, defined by 


d(0) UM,O 

d(1) UM, 
Yi = : , Hmi = : 

d(i) UM,i 


The exponentially-weighted least-squares problem of order M seeks the M x 1 column 
vector w that solves (cf. Sec. 30.6): 


min Atwylmwm + (yi — Haw) M(yi — Hmiwm)] (40.1) 


where II is an M x M positive-definite regularization matrix. In the sequel, we shall 
choose II yy in a manner similar to the fast array method of Chapter 37 (cf. (37.13)), namely, 


Im = 57!diag(A72,A73,..., À- (Mt) (40.2) 


Moreover, 


Aj = diag{', X71,...,A,1) 


is a diagonal weighting matrix, defined in terms of a forgetting factor A that satisfies 
0 « A < 1. It is sometimes convenient to rewrite (40.1) more explicitly in terms of 
the individual data {d(j), um, } as follows: 


i 
min AP ys lw + 3o vla) — uu jUMl? (40.3) 
j-o 


We denote the solution of (40.3) by wy; and we already know that it is given by (cf. Thm. 29.5): 


UM, = PM4|HW Aim (40.4) 


where 
PMi = (AHIM + Hy AH)" (40.5) 


The regularization term A'*1]] 4 guarantees an invertible coefficient matrix, i.e., an in- 
vertible P ;. In the absence of regularization (i.e., when Iy = 0), we would need to 
assume that H m, has full-column rank so that Hy, ;A; Hm, is invertible. Observe that the 
regularization matrix in (40.5) has the form 


AIM = q^!diag( 71, AW?,..., NOM} 


We further let Fa, denote the estimate of y;, 
UM, = HuiUwM (40.6) 


and we refer to Fm, as the regularized projection (or simply projection) of y; onto the 
range space of H y, written as R(H m,i). Recall from Sec. 29.5 that, when H m, has full- 
column rank, the projection matrix onto R(Hy;) is Py = Hma(Hyi Hmi) | Hyg; 
For the regularized problem (40.1), we have gy; = H M PM4 HA iMi Although the 
matrix H y; Pui Hj, ; is not an actual projection matrix, we shall still refer to m,; as the 
(regularized) projection of y; onto R(Hy,,:) for ease of reference. 

We also define two error vectors: the a posteriori and a priori error vectors: 


TM = yi — HMiWM,is Mj 7 Vi (40.7) 


where wm,i-1 is the solution toa least-squares problem similar to (40.1) and (40.3) with 
data up to time i — 1 and with A**! replaced by AŻ, i.e., 


i-1 
min | wylmwm + yag) —umjwm|? | => Wmi- 
WM 
j=0 
The last entries of the error vectors {r m,i, em,i} at time i are denoted by? 
ru(t) = d(i)—- uwjwwj (a posteriori error) (40.8) 
em(i) = d(i)— uw, wMjia (a priori error) ] 


and they are related by the conversion factor, ym (1), 
rM(i) = yu (i)ew (i) (40.9) 
which is defined by 


1 


————————— 40.10 
1+ Atum iPM i-1U¥ i ( ) 


ym(i) = 1—- uy i Pu i = 


Moreover, the minimum cost of the least-squares problem (40.1) is given by (cf. Thm. 29.5): 


€ (i) = yi Air = yi Ailyi - Hmiwm,i} (40.11) 


We also know from Alg. 30.2 that RLS allows us to update waz; and £y (i) recursively as 
follows: 

du (i) = 1+ dum PM iiu i 
A7 yy (i) Py iiM i 


Ko) 
X 
l 


em(t) = d(i)-umiWm,i-ı 

WM, = UMji-i-gM,eM() (40.12) 
Pua = AT Page = 9MgMl TM) 

rM(i) = d(i) — UMAWM, 

E(t) = A€m(i-1) + ru(i)e) 


20Recall that we use subscripts to indicate the time index of a vector quantity and parenthesis for a scalar quantity. 
Thus, we write e m,i and em (i). Likewise, we write rm, and rm (1). 


655 


SECTION 40.1 
MOTIVATION 
FOR 
LATTICE 
FILTERS 


656 


CHAPTER 40 
THREE BASIC 


ESTIMATION 
PROBLEMS 


with initial conditions waz,-1 = 0, £u(—1) = 0, and Py, = I It also holds that 
9M,i = Put i 

The RLS algorithm (40.12) allows us to update ww,i-i to wy, Le. it only per- 
forms a time-update of the weight-vector solution. Here, both w4;;.; and wm, are 
M —dimensional vectors with the former computed from data up to time 7 — 1 while the 
latter is computed from data up to time 7. 

Now, similar to (40.1), consider a least-squares problem of order M + 1, i.e., 


1 
; i+1 ae ; 2 
Jin | A weed + $73) — uua wal 
j-0 


Its solution is an (M + 1) x 1 column vector that we denote by was+1,;. Although an 
order-update relation that takes wm, to ww .4.1,; is possible (recall Lemmas 32.1 and 32.2; 
see also Prob. X.9), the lattice filters of this chapter are concerned with other kinds of 
order-update relations. 

Specifically, lattice filters are not concerned with the weight vectors themselves, but 
rather with the corresponding projections (gw, ¥iz+1,i}. So let dm (i) denote the estimate 
of d(i) of order M; it is the last entry of i, ie, du (i) = uw ww. Likewise, let 
dyz+1(t) denote the estimate of d(?) of order M + 1, which is the last entry of Jm+1.i» 


dy (i) = uu iiM (40.13) 


The corresponding a posteriori estimation errors are r m (i) = d(?) — dy (i) and raga (i) = 
d(i) — d 41 (1), respectively. It would seem that in order to update da (7) to d. 11 (i), we 
may need to order-update wy; to wm+1, However, this is not the case. The lattice 
solutions that we study in this chapter will allow us to update rm (i) to rm+1(i) directly 
without the need to evaluate the weight vectors wm, and wy 41, or even update them. In 
so doing, the lattice filters will end up being an efficient alternative to RLS; efficient in the 
sense that their computational cost will be an order of magnitude smaller than that of RLS, 
namely, O(M?) vs. O(M) operations per iteration. 


40.2 JOINT PROCESS ESTIMATION 


We start our derivation of lattice filters by examining the problem of order-updating the 
projection vector Jm,i, i.e., of relating Fm+1, to Fm, This problem is known as joint 
process estimation. In order to simplify the presentation, and without loss of generality, 
we illustrate the arguments and constructions for the case M — 3. Later, we show how the 
results extend to generic M. 

Thus, assume M — 3 and consider the data matrix 


u(0, 0) u(0, 1) u(0, 2) U3,0 
u(1,0) u(1,1) u(1, 2) u3, 

Hz; = | (2,0) u(2,1) u(2,2) | = | use (40.14) 
u(i,0) u(i1) u(i2) TR 


The subscript 3 refers to the order of the estimation problem (i.e., to the column dimension 
of the data matrix), and the subscript i indicates that the data matrix contains data up to 
time 7. Observe that we are denoting the individual entries of 75,; and, correspondingly, 


of the regressors {ug,;}, by (u(i, j)) with the first index referring to time and the second 
index referring to the column position within the regression vector, namely, 


us; = | u(i,0) u(é,1) u(é,2) ] 


In other words, we are not assuming shift-structure in ua,;, i.e., the entries of us,; are not 
assumed to be delayed versions of some input sequence. If this were the case, then H3; 
would have been of the form 


u(0) 
u(1  u(0) 
Hm u2) ul) u) 


uli) agatha a 2) 


However, since all results in the sequel, until Sec. 41.1, will hold irrespective of any struc- 
ture in u,;, we shall proceed with our arguments by treating the general case (40.14). 
The (regularized) projection of y; onto &( H3 ;) is given by (cf. (40.6)): 


Us; = Ha Pi H3 Ay = Hz iwsi (40.15) 


where 


Pri = (A**1TI; + H3 jM Hs) |, W3, = Pz H3 Aiys (40.16) 


and 


APIS = n7Idiag{Ai-}, i72, Ai-3} (40.17) 


We say that J3, is the third-order projection of y; onto R(H3,). Now suppose that one 
more column is appended to F3 ;, which then becomes 


u(0,0) u(0,1) (0,2) u(0, 3) 
u(1,0) u(1]1) u(1,2) u(1,3) 
u(2,0) u(2,1) w(2,2) | u(2,3) (40.18) 


«Uy "utl 6:371] cu (3) 


where we are denoting the last column of H4,; by x3,;. The (regularized) projection of the 
same vector y; onto the extended range space (H4. ;) is now given by 


Yai = Ha;PAiHg; Ni = Hawai (40.19) 


where 


Pi = (AHIS + H} A Ha) t, Wai = Pai H} iMi (40.20) 


and 
AHIL = 0 ^ diag(M- 1, M72, 73, M71) (40.21) 


Comparing expressions (40.15) and (40.19) for (95,4, a,i} we see that they differ by 
virtue of the difference between the data matrices ( Hs,;, H4,;). However, these data ma- 
trices are identical except for the last column in H4,;. Therefore, it should be possible to 
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relate the projections (75,;, 4,;) and to obtain an order-update relation for them. This pro- 
cess of order-updating the projection of the observation vector is known as joint process 
estimation. 

We already studied such order-update problems in Sec. 32.1. Recall that in that section 
we derived, both algebraically and geometrically, the relations that exist between the (reg- 
ularized) projection of an observation vector onto a data matrix H and onto its augmented 
version [H ^], for some column h. More specifically, comparing with the statement of 
Lemma 32.1, we can make the following identifications 


H — Hz; h — zy P — Pi P, — Pii 
I — AHI a — ql y — h(i) Yz — y(i) 
Q — usi Üz — wai Y— ra js — Thi 


Therefore, using the result of Lemma 32.1, we can relate the variables of the projection 
problems that result in {93,:, 94,,) as follows. Let u$ denote the solution of the least- 
squares problem: 


min [ AP yP* Tus + (x3, i Hs iub)" Ai (Z3,i or Hs wb) ] (40.22) 


Mis 


That is, vi is the vector that projects z3,; onto R(H3,), 


b — 
wai P44H3 jAizsj 


The subscript 3 refers to an estimation problem of order 3, while the subscript 4 denotes 
the use of data up to time 7. The superscript b refers to backward projection. The reason 
for this terminology is that problem (40.22) amounts to estimating the last column of H4, 
from its leading columns, H3,;. Let £5(i) denote the minimum cost of (40.22), i.e., 


&(i) = z$ iba; 


where b3; is the (backward) a posteriori error vector that results from projecting x3; onto 
R(Ha,), 


_ b 
b3; = 234 — H3iw3 i 


We denote the last entry of b5,; by b3(i) and it refers to the estimation error in estimating 
the last entry of z3,; from the last row of H3; (namely, ua,;). 
Define further the scalar coefficient 


b3 iMi " es (0) 


kg(i) $ (40.23) 


q3X-5-60) — v31X- + GI) 


where 


pali) È yt Aib, (40.24) 


Then from Lemma 32.1 we conclude that the following order-update relations hold: 


Pz; 0 1 —ub., ; 

Bis = jÁ st RE E 8.i Br ; 

* | 0 0 | * g-1N74 x £k) | 1 | [-w$; 1] (40.25) 
Vai = Yai Ra (i) bs (40.26) 
Tai = 13, — Ka(i)bs (40.27) 


= ALL les) 
&(i) = &(i) Pin + Ba) (40.28) 
NM _ desir 
yli) = y(i) ETT) (40.29) 
Wai = | wae | + Ka(i) | Usi | (40.30) 


We therefore arrived at an order-update relation (40.27) for the a posteriori error vectors 
{r3.i,74,:}. It tells us that in order to update r3,; to r4,; we need to know 63 ;. In the same 
vein, in order to move forward and update r4 ; to r5,; we need b4,; and so on. This means 
that it is necessary to know how to order-update the backward error vectors as well, which 
motivates us to examine more closely the backward estimation problem. 


40.3 BACKWARD ESTIMATION PROBLEM 


For this purpose, we return to the data matrix H3, in (40.18) and partition it as 


Hsi = [| zop | He, | 


B 


with xo; denoting its leading column and H2,; denoting the remaining columns. In this 
way, the extended data matrix H4, of (40.18) can be partitioned as 


Hai = [ Hs | zs] = [ 2o; LEE (40.31) 


B 


with (zo,;, z3,,) denoting its leading and trailing columns, and Hy denoting the center 
columns. 

We can then consider two backward estimation problems: one has order 3 and estimates 
£3, from H3,;, and the other has order 2 and estimates z3,; from Ha. The first problem 
is the one we considered above in (40.22) with regularization matrix A^*1II4 and it leads 
to the backward residual vector 53 ;, 


with the corresponding coefficient matrix 
Psi = (A**1TI; + Hj AH)? (40.33) 
The second problem corresponds to solving the following least-squares problem: 


min ww? + (3,3 — Ha iub)" Ai(23, — Pnau] (40.34) 


we 
with regularization matrix chosen as 


MIL = 5n !diag(A.7?, M7?) (40.35) 
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The optimal solution of (40.34) is denoted by ub. and is given by 
40.36) 


Poi = (XI + HY Afaa) (40.37) 


with 


and whose residual vector we denote by 
boi = ai — Haiwji 


The resulting minimum cost of (40.34) is denoted by €8(4). The reason for the notation bai 
(with an overbar) as opposed to b2, is that in our development, b»; would correspond to the 
residual vector that results from projecting the third column of H3,; onto the range space 
of its leading two columns. More specifically, denote the columns of H4, generically by 
H4i=[m mn o p |. Thenprojecting oonto [m n] results in the residual vector bz ;, 
while projecting p onto [m n oj results in the residual vector b3 ;. Observe that in both 
cases we start from the initial column m. In contrast, projecting p onto [n o] results in 
the residual vector bo i. The initial column now is n and, hence, the use of the bar notation 
to distinguish between both second-order projections: o onto [m n] and p onto [n o] — 
see Fig. 40.1. We shall study more closely later the relation between {b2,;, b; ;), e.g., in 
Sec. 41.1 where we show that, when the regressors have shift structure, it will hold that 
b2, is related to b2,;—1. For now, it suffices to proceed with bo ;. 

The argument that follows for relating boi and b3,; is similar to the argument we em- 
ployed in the previous section for relating r3, and r4,;. Thus, note that we are faced with 
the problem of projecting the same vector z3,, onto the range spaces of two data matri- 
ces; one is Hi and the other is H3,;, which is obtained from zi by augmenting it by a 
column to the left. 

We already studied such order-update problems in Sec. 32.2 in some detail. Recall that 
in that section we derived, both algebraically and geometrically, the relations that exist 
between the (regularized) projection of an observation vector onto a data matrix H and 
its augmented version [h H], for some column ^. More specifically, comparing with the 


FIGURE 40.1 Two second-order backward projection problems with the corresponding residual 
vectors. 


statement of Lemma 32.2, we can make the following identifications: 


H — Hj h — Toa P— P P, — Py 
II — ATiz oe g 1M y — a(i) Yz — yali) 
iD — wb i z — w$; y — bzi V. — bzi 


Therefore, using the result of Lemma 32.2, we can relate the variables of the projection 
problems that result in (59 ;, b3,;} as follows. 
Let uli denote the solution to the least-squares problem: 


min | Awiw? + (Zoi — Ha iwi) Ai(zo,i — Ha wl) | (40.38) 


w2 


which projects the leading column zo,; onto R(Ho ;), namely, 


wh, = P HI Aizo, 


The subscript 2 in ul, refers to an estimation problem of order 2, while the subscript i 
denotes the use of data up to time ?. The superscript f refers to forward projection. The 
reason for this terminology is that the above problem can amounts to estimating the leading 
column of H3,; from its trailing columns, H» ;. 

Let és (i) denote the minimum cost of (40.38), i.e., 


&) = 25, fas 
where f2, is the (forward) a posteriori error vector that results from projecting zo,; onto 


R(H»). 


We denote the last entry of f2,; by f2(i) and it refers to the estimation error in estimating 
the last entry of zo, from the last row of H» ;. 
Define further the scalar coefficient 


* M i * 
Bo AN ee a (40.39) 
qi) AF +) 
where 
(i) £ F2; Ma, (40.40) 


Then from Lemma 32.2 we conclude that the following relations hold: 


0 0 1 | 1 p 
ibas E | oF eae 1 —wj; 40.41 
| [o Pu | mv e d) | uds [i] ew 


bzi = beg K)fai (40.42) 


x 
| 


z ;M2 
dO NE. do- — Lr (40.43) 
yli) = za -O _ (40.44) 


1-11 + & (i) 
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We therefore arrived at an order-update relation (40.42) for the a posteriori backward 
residual vectors {b3,;, bo i}. It tells us that in order to update boi to 03,; we need to know 
f2,;. In the same vein, in order to move forward and update bsi to ba; we need fs; and so 
on. This means that it is necessary to know how to order update the forward error vectors 
as well, which motivates us to examine more closely the forward estimation problem. 


40.4 FORWARD ESTIMATION PROBLEM 


To do so, we reconsider the data matrix H4 ; in (40.18) and now partition it as 
Hai-[z0; | Hai] = [xoi | Hoi 23, | (40.45) 


where HE denotes its trailing columns. We then consider two forward estimation prob- 
lems: one has order 2 and estimates x9,; from H. 2,i, and the other has order 3 and estimates 
zo, from Hs. The first problem is the one we considered above in (40.38) with regular- 
ization matrix A!II? and leads to the forward residual vector fous 


foi = Zoi — Hui; (40.46) 


with a = — 
Pi = (A‘IIg + Hz AH)! (40.47) 


The second problem corresponds to solving the following least-squares problem: 


min [iwf Twi + (zoi = Hi wi)" Ai (20, = Ha wi) (40.48) 


3 
with regularization matrix 


AIl = n idiag(A7?, 573, X74) (40.49) 


The optimal solution of (40.48) is denoted by wi E 


40.50 


P,; = (NIIS + Hz, Ai Hs) (40.51) 


with coefficient matrix 


The corresponding residual vector is 
fai = Toi — Ae wi 
3,4 = X0, 3,1073 i 


and the resulting minimum cost is denoted by & (i). Observe that we are now denoting 
the residual vectors of problems (40.38) and (40.48) by fo; and fs ;, respectively, without 
the need for the bar notation. Thus, note that if we again denote the columns of H4; 
generically by H4; = [ m n o p]. Then projecting m onto [n o] results in the 
residual vector f? ;, while projecting m onto [n o p] results in the residual vector f3,;. In 
both cases, we start from the same initial column n — see Fig. 40.2. 

Again, the argument that follows for relating f2, and fs; is similar to the arguments 
we employed in Secs. 40.2 and 40.3 for relating r3,; and r4 ;, as well as boi and bs ;. Thus, 


FIGURE 40.2 Two forward projection problems with the corresponding residual vectors. 


note that we are faced with the problem of projecting the same column vector, zo,;, onto 
the range spaces of two data matrices: one is H2 ; and the other is H3,;, which is obtained 
from Ei; ; by augmenting it by a column to the right. 

We studied such order-update problems in Sec. 32.1. Recall that in that section we 
derived, both algebraically and geometrically, the relations that exist between the (regu- 
larized) projection of an observation vector onto a data matrix H and onto its augmented 
version [H hj, for some column ^. More specifically, comparing with the statement of 
Lemma 32.1, we can make the following identifications: 


H — Hy he— 23; P— Pj P, — Ps; 
II— XII aoe qx y — alt) Yz — F(t) 
w — wh Ü: — uj; y — fri Yz — fri 


Define further the scalar coefficient 


boAGe 0 63 (i) (40.52) 


OINERA + BH) 


Note that we are using 63 (i) in the numerator of «4 (i), with 82 (¿) being the coefficient we 
used in the numerator of x? (i) in (40.39). This is because 


b3 Aizo, = [zs — Aniw J Ato, = [ri Foi Pri Ñ} Aisi)" Aizo 
= a$M[I -— Ho, P2; H5 Aico, 
= as$A[ro; - Haul] 
= atu (40.53) 


That is, ó3(1) is also given by 


õa (i) = xå Aide: (40.54) 
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Therefore, using the result of Lemma 32.1, we can relate the variables of the projection 
problems that result in ( fo ;, f3,;} as follows: 


D. — Pj 0 1 -už b* 

Pay = | 1 o^ mea 5| i: IE 1| 0.55) 

fas = fa - ibo (40.56) 

fay = gta — EOP T 

do = deo- — a (40.57) 
Ae 

ai = eee (40.58) 


n7idin4 + &k(i) 


40.5 TIME AND ORDER-UPDATE RELATIONS 


We summarize the order-update relations derived so far for the case of a generic order M. 


Order-Update of Estimation Errors 
Consider several equivalent partitionings of the data matrix H y, 1: 


Hu = [ zo, Zii e EMi] 
| Hui Tmi | = [ zo; Hy | = [ zoi Hy tm, | 


where {x;,;} denote the individual columns of H y, ;. Let also {um,i, Zi} denote the 
last rows of Haz; and Hy ;. 


Let further 
TM4 = aposteriori residual from projecting y; onto Hm, 
bu, = a posteriori residual from projecting z m, onto H Mj 
fuji = a posteriori residual from projecting zo,; onto H. Mj 
bui = aposteriori residual from projecting z m+1, onto Hm, 


where the projection problems for {r m,i, bm,i} employ the regularization matrix A^* ! IT, 
while the projection problems for {fm,i, bm,i} employ the regularization matrix A'IIy. 
Hence, 


= = b 
TMi = Yi- HMiUMj bmi = Imi - Hu iwyy (40.59) 
fui = Zoi— Hy, bu; = XMjii— Haw, 


where, for example, wj, i is the solution to the regularized least-squares problem: 


min Awl?Iywl, T (zo,i = mawl)" Alo, xs Hy iwl.) | (40.60) 


WM 


and similarly for (wy i, wb, i» ubi). These constructions are depicted in Fig. 40.3. 
The last entries of (rai, baz,i, fai, bax} are denoted by 


(ru (i), b), fu (è), bar )) (40.61) 
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(a) Backward projection (b) Forward projection 


FIGURE 40.3 Projections of (zo, c Mi, X M+1,i, yi} onto the relevant data matrices with the 
resulting residual vectors ( fMi, bi OM,i, TMi} 


The derivations in the earlier sections show that these residual vectors satisfy updates of 
the form: 


rusia — TMa-KRM (bus Pu = 0M Eu) fus fmi = fMi Ku) oM 

(40.62) 
where we still need to derive an update for by;.;. We postpone this discussion to Sec. 41.1 
due to its dependence on data structure. From (40.62) we obtain the following relations for 
the a posteriori estimation errors at time i: 


rusi(t) = ra(t) — &w)bu() 
bra) = B) = heli foal) (40.63) 
fua) = fw()-x()bu) 


where the scaling coefficients (y (i), «5, (i), DAOS also called reflection coefficients, 
are defined as the ratios 


smli) = p)/Q AM + e) 
nyO = éw(i)/(n^1N7! + £f (i) (40.64) 
smi) = Su(0/( ATM? 6) 


and the quantities {ôm (i), pm (1), £8, (1), £5, (1), €f; (i)) are defined in terms of the inner 
products 


pm(i) = yi Aibmi e) = TM iAibM,i 

ôm(i) = rå Aidyy i = fui MTM Eke (7) = Tip Lui MOM 

LO = wh Ashura Em(t) = yf Aira, 
(40.65) 


The quantities {£m (i), £5, (1), &, (i), £5, (1)) denote the minimum costs of the projection 
problems that result in (ry; bris fm,is bui). Let further {ym (i), Ym (i)) denote the 
conversion factors associated with the projection problems {r m,i, br) and {fmi bmi} 
(the first two have the same conversion factor ym (i), while the last two have the same 
conversion factor 7/44 (1)). That is, 


[Ai + Au MH; A 


yml) = l-uwiPuiuM; Puy 
[Im + Aa Ast, | 


mli) = 1l-üwiPuüh; Pui 


I 
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Then the earlier discussions also established the following update relations: 
Emili) = £u) — leu) / (7 AM? + Eu (i) 
Oa) = eli) — l6 COT 7171 + e) 
Gra) = uÀ — le CP (071877 + EO) 


(40.66) 
ym+(i) = ym(i)-— |bu() xU NS dmn + €4,(i)) 
mali) = Fai) — Mac 2/7157 + e.) 
myli) = Y(t) — lbm (i)? /(n7 AM? + Ee (i) 
as well as (cf. (40.30)):! 
WM4+1i = | ju | + Kau (1) | “hs | (40.67) 


We still need to show how to update the factors {pm (i), ôm (i)} in (40.65) in order 
to arrive at an efficient recursive scheme. The derivation in the next section shows that 
time-updates for (py (i), ôm (1)) are possible regardless of data structure. This is a useful 
observation; for example, it allows one to extend least-squares lattice algorithms to more 
general filter structures (other than tapped-delay-line structures) — see, e.g., Chapter 16 of 
Sayed (2003) on Laguerre lattice filters. 


Time-Update Relations 
Consider first the quantity ôm (i) = $,,Aibz,;, which appears in the numerator of MO) 
in (40.64), and introduce the data matrix 


Hm+2i = [xoi Hui tw] 
We partition it as 


Husos T04—1 Hy ia TM-rli-l 
UM CU ['u(i0) im, u(i,M +1) 


where we are denoting the last entries of (zo,;, 27+1,:} by {u(i, 0), u(i, M +1)}, and the 
last row of Hy; by uy,;. Consider further 


óy (i — 1) = 26; 1 Aicibua 


Now recall that 6 M, is the residual vector that results from projecting z m+1,; onto R(H M) 
with regularization matrix A'ILy. Likewise, b M, i1 is the residual vector that results from 

projecting 24741,;-1 onto (Ey; 1) with regularization matrix A~! m. We are there- 

fore faced with the problem of time-updating the inner product ôm (1), which is of the same 

form as the problem studied in Sec. 32.3. More specifically, comparing with the statement 

of Lemma 32.3 (or with the data matrix (32.48) and its time-updated version (32.50)), we 

see that we can make the identifications: 


Hi, — Hy 3() — Yu (2) B(1) — uli, M +1) 
Ti-1 €— CTo,-1i hi — tm, a(t) — fui) 
a(i) — u(i, 0) Zi-i €— TM+1,i-1 B(i) — by(i) 


?VThis last recursion is used in Prob. X.9 to establish a relation between the standard RLS solution (40.12) and 
lattice filters. 


and arrive at the time-update relation (cf. (32.56)): 


bu (i) = AG — 1) + A0 (40.68) 


M(t) 


Consider now the inner product pm (i) = y? Aibmi, which appears in the numerator of 
Km (i) in (40.64), and introduce the matrix 


[y Hus «tm, | 


Let us partition it as 


Vi-1 Hyuj-i TIMj-1 
d(i) UM, u(i, M) 


Now recall that by, is the residual vector that results from projecting zm, onto R(Hy,,;) 
with regularization matrix A^*! IT, while br; is the residual vector that results from 
projecting z,;-1 onto (Hy; 1) with regularization matrix AII. We are therefore 
faced with the problem of time-updating the inner product pm (i), which is again of the 
same form as the problem studied earlier in Sec. 32.3. More specifically, comparing with 
the statement of Lemma 32.3 (or with the data matrix (32.48) and its time-updated version 
(32.50)), we see that we can make the identifications: 


Hi-i — Hua yli) — ow) B(t) — u(i, M) 
Ti-1 — Vi-i hi — uy a(i) — ew(i) 
a(i) — d(i) Zi-i €  ZMj-1 Bt) — bu (2) 


and arrive at the time-update relation (cf. (32.56)): 


pm (i) 7 Apu (i — 1) + ru bu) 


- 40.69 
aul) ee 


We can also obtain time-updates for the minimum costs (£7, (i), €3,(i), £5, (1)) in much 
the same manner as above. Alternatively, since these variables correspond to the minimum 
costs of regularized least-squares problems, and since we already know how to time-update 
such minimum costs (cf. Alg. 30.2), we can readily write 


&() = A&G- 1) + [fu COP 63) 
E(t) = Aul- 1) + lomil Iw G) (40.70) 
€) = Ad, - 1) + [ba (4)?/Fm (i) 


Table 40.1 collects the various order- and time-update relations derived so far in the chapter. 
We again emphasize that these relations are independent of any data structure. 

For convenience of notation, and also in order to save on addition operations, we intro- 
duce the modified cost variables: 


173-47 e d.) 
qc + ef () (40.71) 
mM? + (i) 


Ie We IIe 
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TABLE 40.1 A listing of the time and order-update relations derived in Secs. 40.2—40.5. All these 
updates are independent of data structure. 


ELO = Mi, — 1) + | fre OP Fae (2) 
Ebr (i) = AER G — 1) + Jom Ov (2) 
Eke (i) = A6. G — 1) + bm GO /7m (i) 


£u) = ém (i) - lea (i)? [Su 
a0 = = Eul - |ôm (i)? Maat 
eae E GAO) = lõm (i)|? S40) 


pu (0) = Apa (i — 1) + rm b (2) / (4) 
ôm (i) = Au (i — 1) + f Con G)/ 6 (i) 


&M(i) = Pu (0/ e) 
su = = õm (i)/Cu 0) 
he (i) = Suli) / uÀ) 


TM-A(i) = rm (i) — sm (i)bm (i) 
bua (i) = bm (i) - nuli i) fm (i) 
fusili) = fuli) - niu Gba) 


yen (i) = ym (å) — lom (0/0) 
vac (4) = Fae (2) — [fre Àh 
331 (i) = 522) — [Bae COP / Ch 3) 


It is easy to verify from the time and order-updates for (£5, (i), € MOS e, (2)) that these 
modified variables satisfy similar updates, namely 


qu = Ag G = 1) GP /^u (1) 
Gli) = Aci = 1) + [bm COP /ym(i) 
QuG = AG- 1) bu)? /5w) 
(40.72) 
Oui) = Cui) — leui PIG i) 
Ora() = Geli) — in COP/G C) 
Gra = díG- lêm il/i) 


albeit with initial conditions 
GC) -27A-42, d(-1) 25743, qdQ4-0)25g73-M-3 (40.73) 
while the initial values for the original variables are 


&& (71) = £f, (71) = & 71) 


It 
o 


(40.74) 


owes 4 1 


Lattice Filter Algorithms 


Fu 41.1 illustrates how the error variables {r m (i), bm (i), fm (i) } are related in terms 
of the reflection coefficients {xm (i), «3, (4), e). as was described in Chapter 40. It 
should be noted that the recursions listed in Table 40.1 help characterize almost fully the 
operation of the structure shown in the figure. The only missing piece of information is 
to know how to update the error sequence {bm (i)). This fact is indicated schematically 
in Fig. 41.1 by the boxes with question marks. It is the update of these variables that 
is determined by data structure, and figuring out their update is the key to achieving an 
efficient algorithm; by efficient we mean O( M) operations per iteration for a filter of order 


dd j TM (i) 
KM -1(i) 

u(i, 0) by ii) 
fua) 


FIGURE 41.1 Relations among the residuals {r m (i), f(z), bm (?)). The boxes with question 
marks indicate that we still need to develop a relation between {bm (i), bm (i)}. This relation turns 
out to be a function of data structure. For example, if the regressors have shift structure, then the 
question marks will be replaced by pure delays, z7}. Other data structures would lead to other 
choices for the blocks with questions marks — see the derivation of Laguerre lattice filters in Chapter 
16 of Sayed (2003). 


41.1 SIGNIFICANCE OF DATA STRUCTURE 


To illustrate how the evaluation of by (i) is dependent on data structure, we now focus 
on the case of regressors with shift structure, i.e., we assume that the entries of um, are 
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delayed versions of an input sequence (u(-)) so that 


Comparing the expressions for both u m,; and uyzi41 we see that um,i+1 is obtained from 
UM, by shifting the entries of the latter by one position to the right and introducing a new 
entry, u(i + 1), at the left. In this way, the data matrix Hm, will also exhibit structure, 
e.g., for M = 4, it will have the form (compare with (40.18)): 


u(0) 0 0 0 

u(1)  w(0) 0 0 

u(2 u(i) u(0) 0 
Hii— | w(3) w2) a) u(0) 


u(i) wu(i—1) w(i—2) u(i—3) 
where we are assuming u(j) = 0 for j < 0. Observe that every column of Hm, is a 
shifted version of the previous column, i.e., every column is obtained from the previous 
column by shifting its entries downwards by one position and by adding a zero entry. This 


means that any two successive columns of Hmi, say, (25, Lari}; are related by the 
lower triangular shift matrix Z, i.e., 


where Z is the (i + 1) x (i + 1) lower triangular matrix with zeros everywhere except for 
unit entries on the first sub-diagonal, e.g., for i = 3, 


0 

|. |1 0 

ae 1 
0 

Now, as in (40.31) and (40.45), we partition H m+1,: as 


Hm+i= [| toi Hus |= Hmi zwi] 


to find that, in view of (41.1), the following relation holds between ( Hy i, Hy, i): 


m 


In addition, the following relation also holds between (H m,i, Hy,i-1): 


f 0 
ZHyi- | Hui | (41.3) 


With these relations we can now relate the residual vectors bus and bas. Thus, recall 
their definitions: 


F = H b = 
bM i = TM+ — HMiWM,i bmi = smi- Huy, (41.4) 


SECTION 41.2 

X Sa, E x l a NA SIGNIFICANCE 
wh; = PugHw;Mmzwa Pus = (Xim Hy AH) OF DATA 
uh, = PyH MTM Pui = QUII + Ah, AH) STRUCTURE 


(41.5) 
Substituting (41.2) and (41.3) into the expression for Pui we obtain 


0 


E 
H J = (AT +H} 4-1 Ai-1 Hm i-1)* 
Mi-1 


Pmi = (tw -[0 Hips. | Ai | 


That is, 


Substituting this result, along with (41.1), into expression (41.5) for ub, i we obtain 


wmi = PuaaHWiZ WiZzya 
= PuiaHywiaWi-zwa 
= UM il 
so that 
bmi = tm+ii — Hua 


b 
= Zími-ZHmMiWMi-1 


[s] 7 Les ] 
TM,i-1 Hmi- 5 


os 
bu i-i 


and, consequently, by equating the last entries of both sides, 


bu (i) = bi — 1) (41.6) 


In other words, we find that in the shift-structured case, the residual errors {bm (i)} are 
time delayed versions of {bm (i)}. In a similar vein, it can be verified that 


(41.7) 


Interpretation of the Estimation Errors 


In the shift-structured case, we can provide additional insights into the meaning of the a 
posteriori estimation errors ( fm (i), bm (i), rm (i) }. Indeed, in this scenario, we have from 
the definitions of these errors, 


fuli) = u(i)-uwi-iwi, 
u(i - M) - UM iyi 


d(i) — UMAUM, 


sT oo 
& X 
— — 
e ^£ 
oS a 
Fog 


where 
umi=([u(i) wi-l) ... wi-M-c1)] 
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In this way, f (i) can be interpreted as the forward prediction error in estimating u(i) 
from the M past values, while bm (i) can be interpreted as the backward prediction error 
in estimating u(i — M) from the M future values. 


41.2 A POSTERIORI-BASED LATTICE FILTER 


If we substitute the results (41.6)-(41.7) into Table 40.1, we arrive at Alg. 41.1. This filter 
is known as the a posteriori lattice filter since it relies on the propagation of the a posteriori 
errors {fm (i), bar (i), rac (2)); the filter is depicted in Fig. 41.2. 


Algorithm 41.1 (A posteriori lattice filter) Let 0 «& À < 1 be a forgetting fac- 
tor and define II = n` !diag(A-2, À73,..., A-(M*D)). Consider a reference 
sequence (d(j)) and a regressor sequence {um,;} with shift structure and 
of dimension 1 x M, say, us; = | ulj) wj—-1) ... wj-M-1)]. 
For each i > 0, the M-—th order a posteriori estimation error, r(i) = 
d(i) — uy iw M, that results from the solution of the regularized least-squares 
problem: 


t 
min A wid + Sod) - uu wp? 
j=0 


can be computed as follows: 


1. Initialization. From m = 0 tom = M — 1 set: 


d(-De-egq Ws qco)y-yuwm? 


2. For à > 0, repeat: 


e Set yoli) = 1, boli) = fo(i) = u(i), and ro(i) = d(i) 
e From m = 0 to m= M — 1, repeat: 


d) = AdG-1) Ife) om G - 1) 
Qi = AG -1- Iba s) 
5m (i) = AQ - 1) =F f (ibm (i = 1)/Ym (i -1) 
Pmli) = Apm(i — 1) + rh li)bm(i)/ (1) 
Ym+1(i) Ymli) — los. (12/2, (4) 
uh) = és0/d.() 
sf.) 5%, (0/C. (6 — 1) 
kmli) = li RÀ 
bmili) = bmi- 1) ~ Kb, (i) fm(é) 
fmai(i) = fli) — nbi 1) 


Tm+i(t) = Tm(i) — &m()bm() 


FIGURE 41.2 The a posteriori-based lattice filter. 


Remark 41.1 (Data structure) What if two successive columns of the data matrix H m, are not 
shifted versions of each other as in (41.1), but are instead related by some other matrix ®? Would it 
still be possible to derive a lattice algorithm? Interesting enough, the answer is positive — see, e.g., 
Chapter 16 of Sayed (2003), which deals with Laguerre lattice filters. 


o 


41.3 A PRIORI-BASED LATTICE FILTER 


Algorithm 41.1 relies on order-updates for the a posteriori estimation errors denoted by 
(ra), faa (2), bar (i) }. A similar algorithm can be obtained by relying instead on a priori 
errors, which are defined as follows. 

Refer again to Sec. 40.5 and to the data matrix H m+1, with its equivalent partitionings: 


Hmi =| zoi mw 6. tui] = [Hmi zou | 
[ to: Amy | 


= [ Toi Hy IM, | 


We define the a priori residual vectors as follows: 


€m,i = apriori residual from projecting y; onto Hyg ; 

Bui a priori residual from projecting xm, onto Hm,i 
OM, a priori residual from projecting £o, onto Ëm; 
Dui = apriori residual from projecting z m+1,i onto H Mj 


where the projection problems for {e m,i, 87,:} employ the regularization matrix AHI m, 
while the projection problems for {a.7,;,3.4,;} employ the regularization matrix A'ILy;. 
The term a priori in the above definitions means that prior weight estimates are used in the 
definition of the errors. More specifically (compare with (40.59)), 


=, ry f 
emi = Yi- HMiWM,i-1 aMi = oi- HujWy iii 
b 5 B b 
bmi = tmy Hui BM TM+1i ~ HMaUMaa 
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where now irn for example, is the solution to a regularized least-squares problem of 
the form 


min | twl Tw, + (zo,i—1 = Hy i vwd)! Ajzi(zo4-i — Hy iciwl4) 


whe 
(41.8) 
Comparing this cost function with (40.60) we see that ¿ is replaced by i — 1. The last 
entries of the above residual vectors are denoted by (ey (i), ew (i), 8m (i), 8m (1) and 
they are referred to as the a priori estimation errors. They are, of course, related to the 
corresponding a posteriori errors (40.61) via the associated conversion factors: 


em(t)ym(t), b(t) = Bw(i)vu() 
aM()yM(), b(t) = Bu) vw (4) 


rm (i) 


fu) 


By following arguments similar to what we did in Secs. 40.2-40.4, and which led to 
(40.63), it can be verified that these a priori errors satisfy the following order-update rela- 
tions in terms of the same reflection coefficients («y (i), MOS Ko (i): 


eMai(i) = emw(i)-kmu(i— 1)8u(i) 


Bu41(t) = Bw (i) = sy = l)ay (2) (41.9) 


OM) am(t) — ky (i — 1)8 (1) 


Again, relations (41.9) hold irrespective of data structure. However, as was shown in 
Sec. 41.1 for the variables {bm (i), ba (i)}, when the regressors possess shift structure it 


will also hold that 
B (i) = Bu (i — 1) 


In addition, the a priori estimation errors (e (i), Bw (i), ew (3)) will admit the following 
interpretations: 


amli) = u(i) - uM i—1W4y jy (41.10) 
Bu(i) = wu(i-M)-uwwhi (41.11) 
em(i) = d(i)— uw wMj-i (41.12) 
em (i) 
KM -i(i) 
Bu -1(i) 
am-ı(i) 


FIGURE 41.3 The a priori-based lattice filter. 


In other words, am (i) denotes the forward prediction error in estimating u(i) from the M 
past values using the prior weight estimate, vl, i while 3m (i) denotes the backward 
prediction error in estimating u(i — M) from the M future values using the prior weight 
estimate, wri The resulting a priori-based lattice filter is listed in Alg. 41.2 and shown 
in Fig. 41.3. 


(a 
Algorithm 41.2 (A priori lattice filter) Consider the same setting of Alg. 41.1. 
For each i > 0, the M—th order a priori estimation error, ew (i) = d(i) — 
uM iWM,-1, that results from the solution of the regularized least-squares 


problem 
i-1 
: iQ t-1-j jc ; 2 
min Xwmlmwm + 2, |d(j) - um jwm| 
J= 


can be computed as follows: 


1. Initialization. From m = 0 to m = M — 1 set: 
ôm(—1) = Pm(—1) =0, Ym(—1) =1, Bm(-1) =0 


nf, (71) = & (71) = Km(—1) = 0 
AI) 9017, Q(-1) = tae 


2. For i > 0, repeat: 


e Set yo(i) = 1, Bolt) = ao(i) = u(i), and eo(i) = d(i) 
e From m — 0 to m — M — 1, repeat: 


d) = AdG-9-les(D s - 1) 
Qu) = Ad. - 1-185) y) 
Ôm(i) = Am(i-—1)+0%,() Bn — 1)Ymli — 1) 
Pm(i) = Apm(t T 1) T er (1) Bm (1) ym (2) 
bmili) =  Bm(i—1)— &b (i — 1)os (1) 
Omili) = am(i)- xL (i-1)8(i — 1) 
emi) =  em(i)- kmli — 1) (i) 
(mai) = Y(t) — Is (i) Bm 2/65, (4) 
Kei) = bm(i)/CE(@ 
wh (i) = 0/0. - 1) 
O = wA 


Km(i 
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Error-Feedback Lattice Filters 


Buses the a posteriori- and a priori-based lattice filters of the previous chapter 
are common lattice forms, several other equivalent implementations exist such as error- 
feedback forms, array-based forms, and normalized forms. All these variants are, of 
course, theoretically equivalent. However, they can differ in performance under differ- 
ent operating conditions that arise, for example, in finite-precision implementations or as a 
result of noise and regularization. 


42.1 A PRIORI ERROR-FEEDBACK LATTICE FILTER 


In this section we derive the so-called error-feedback form, which tends to exhibit good 
performance under finite-precision conditions. The algorithm can be motivated as follows. 
Observe from the a posteriori lattice recursions of Alg. 41.1 that a pivotal role is played 
by the reflection coefficients {xm (i), &5, (i), BAON These coefficients are computed as 
ratios of certain quantities. For example, DNO, is computed as the ratio 6%,(i)/¢3, (i — 
1), with separate recursions used to evaluate the numerator and denominator quantities. 
Likewise for {x}; (1), xm (1)). 

The error-feedback lattice form replaces the separate recursions for the numerator and 
denominator quantities by direct recursions for the reflection coefficients themselves. In 
principle, we could derive these recursions algebraically as follows. Consider for instance 


Ka (i) = pu G)/ 6) 


where, from the listing of the a posteriori-based lattice filter, the numerator and denomi- 
nator quantities are updated via 


eui) Apu — 1) + ry (i)bm G)/ ia G) 
Che Q 
Substituting these expressions into the above expression for «m (1) leads to 


Ap (i — 1) + rw (tb) /1 (3) 
AC — 1) + [ba OI I G) 


(42.1) 


AC. — 1) + [bm GO)? / Ym (2) 


ku (i) = 


and some algebra will then result in a relation between «m (i) and Ky¢(i — 1). We shall 
not pursue this algebraic route. Instead, we shall employ a more elegant geometric ar- 
gument and use it to provide an interesting interpretation for the reflection coefficients 
{ea (i), MO! Kl, (i)}. Specifically, we shall show that each one of them can be inter- 
preted as performing the projection of a normalized error vector onto another normalized 
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error vector. This interpretation allows us to invoke many of the results we have already de- 
veloped for projection (least-squares) problems and to arrive at recursions for the reflection 
coefficients almost by inspection. 

We start with the reflection coefficient 


a(t) = pu (i)/C (i) (42.2) 


where from Table 40.1, the numerator and denominator quantities satisfy the time-updates 
(42.1). Now define the so-called angle normalized estimation errors: 


bM D = Bu) O 
(42.3) 


ru(/v (6) = em (igh () 


The reason for the qualification “angle-normalized” is that, since the conversion factor 
satisfies 0 < yy (i) < 1, its square-root can be interpreted as the cosine of an angle, say, 


PG i) = cos ọm (1) for some dys (îi). 
Using these normalized variables, we can rewrite recursions (42.1) as 


p(t) = Apm(i-1) + ry (Ob) 
(42.4) 
uO = Au- 1) + [Oye GP? 
which, in view of the initial conditions p (—1) = 0 and £5,(—1) = 0, lead to 
i 
pm (i) d» Nr (Deu), — Bel) = JOAN OP 
i-0 
In other words, if we introduce the following vectors of angle normalized errors:?? 
bm (0) rM(0) 
» a | 040) , A | TMC) 
Mi = A , TMi = : 
b) rM) 
then the quantities {pm (1), £5, (1)) can be recognized as the inner products 
pmli) = ri ME ER) = uM (42.5) 


where 
Ay = diag{', \'!,...,A, 1} 


Expression (42.5) should be compared with the original definitions (40.65) for these vari- 
ables, namely, 


M) - yi Aib; and E) = my Aiba 


T'fhe vectors bj; and b^, differ in a fundamental way; the entries of bm, are not 
{bm (0), bm (1), . bu (i) }! Likewise for rt; and rj, ,. 
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ERROR-FEEDBACK : kde 

LATTICE FILTERS us to re-express the defining relation (42.2) for xm (i) in the form: 


ka (i) = (PTAH + b Aibua) D Ta (42.6) 
where 7j = 7A *?, This expression for «y (i) has the form of the solution to a regularized 
least-squares problem! In other words, xm (i) can be regarded as the solution to the prob- 
lem that projects (in a regularized manner) the vector r^, ; onto the vector 55, ;, namely, it 


solves 


min Lar + lr - eral, | = kpli) (42.7) 


This observation readily establishes that xm (i) can be time-updated via a standard RLS 


update of the form: 
E nae E! C li) ~ Hear =D] (42.8) 

= su - 1) + DE ra li) — Buruli- 0) 
= uc Ty + Ait eua) (42.9) 


Comparing the statement (42.7) with that of a generic regularized least-squares problem 
(40.1), we see that «x4 plays the role of wy, bi plays the role of Hy i, TM plays 
the role of y;, 7j-! plays the role of Im, and Q5, (i) plays the role of Py this latter 
conclusion is because (b, (i) can be seen to be the coefficient of the normal equations 
(42.6), namely, C (i)&w(i) = V Air; Then (42.8) follows from the update for 
wm,i in (40.12). Indeed, the last equation (42.9) is obtained from the order-update (41.9) 
for ew 41(i). In a similar vein, we can derive updates for e) and «5, (i). Thus, define 
the angle normalized estimation errors: 


Bu i) À 
(42.10) 


au) (i) 


and the corresponding vectors 


by (0) fa (0) 
s a |0) E 
Mi 7 : , Mi — : 
b) fu) 


Then the defining expressions for {« MC i), kè p (i)) from Table 40.1 can be written as 


ki, (i) = (m ATM- + by MD i) by Mf Ma 


KMO — (TAE + faethe) Faris 


These expressions again suggest that we can interpret {sf (i), Kh, )) as the solution of 
the problems that project (in a regularized manner) the vectors { fi, ;, 05, ;) onto the vec- 
tors (b^, ;, f, i» namely, they solve 


; a—lyitli f i2 
min | 7 AT {kK + | 
a | Ieri 


fy : fq; 
— Kyou, " = kul) 


; Ae z 2 ; 
min [gp + [Bs — isl] = aO 
M 


where Ñ = 7”. These observations therefore establish that (e, i), K^, (1)) can also be 
time-updated via standard RLS recursions as 


rh) = xL-1) + BG Gua (0/0) (42.11) 


Ku) = x(G-1) oli) (Bu (i)/CE (42.12) 


Table 42.1 collects the various order and time-update relations derived so far for the a 
priori estimation errors, in addition to updates that are obtained immediately from those 
of Table 40.1 by simply rewriting them in terms of a priori errors (like, e.g., the first three 
relations in Table 42.1). The relations of Table 42.1 are again independent of any data 
structure. The only update that is missing is one for the error sequence (Py (i)). 


TABLE 42.1 A listing of time and order-update relations for the a priori estimation errors; these 
updates are independent of data structure. 


& (i) = MÍ,G — 1) + lou G) amli) 
Eke (i) = A G — 1) + len GO i) 
€) = Mu — 1) + [Bu |? 3 G) 


Em4i(t) = £u) — le P Jal) 
Eun = = Eu (i) - |6m(i)/? [Gu 
a) = BAO) m jóm (2)? / 6h (0) 


em+i(t) = em (i) — &u(i — 1)8u (i) 
Bui) = u(i) ~ Kuli- 1)ew) 
am+i(i) = am(i) — khu — 1)Ēu (i) 


(i) 2 &u(i- 1) + Bu Gym Genes O/C (@) 
(i) = GG - 1) + Bu Om (Dam (0/6850 
nml) — s - 1) + ai (Dm 08s / G5) 


i) = ym (i) — Iw (i)m G))2/ C4 3) 
i) = Fm (i) — at (iam GP / 01 G) 
5 (i) — Pic (i)8m GO / C (2) 


However, as was shown in Sec. 41.1 for the variables {bm (i), bm (i)), when the re- 
gressors possess shift structure it will hold that Bm (i) = (i — 1). The resulting a 
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priori-based error feedback lattice filter is listed in Alg. 42.1. The reason for the qual- 
ification “error-feedback” is that, as seen from the listing of the algorithm, the errors 
[eura (i), aur (2), Bar4i(i — 1)} are fed into the recursions for the reflection coeffi- 


cients (ky (i), «4, (i), Kè (i). 


Algorithm 42.1 (A priori error-feedback filter) Let 0 « A < 1 be a forgetting 
factor and define Iim = n7 !diag(A-?, A73,..., A- (M*U). Consider a refer- 
ence sequence {d(j)} and a regressor sequence {umj} with shift structure 
and of dimension 1x M, say, uy; = | u(j) ... u(j— M +1) |. For each 
i > 0, the M—th order a priori estimation error, em (i) = d(i) — uy iWM,i-1, 
that results from the solution of the regularized least-squares problem: 


i-1 
min |Awylmwm 1371771) — uar; 
WM j=0 


can be computed as follows: 


1. Initialization. From m = 0 to m = M — 1 set: 
Ym(—1)=1, Bm(-1) =0 
kh (-1) = kh (—1) = &m(-1) =0 
A(-) 290143, O(-D = hae? 


N 


. For i > 0, repeat: 


e Set yo(i) = 1, Boli) = a(t) = u(i), and eg(i) = d(i) 
e From m = 0 to m= M — 1, repeat: 


Cd) = ALG 1) + lamli- 1) 
GG) = AG - 1) + MBs Ps (2) 
Bmsi(i) =  Bm(i—1)— &b (i — 1)as (i) 
amili) = amli) — Li — 1)8s (i — 1) 
em+1(i) = em(t) — «(i — 1)85 (i) 
kmli) = Km(i—1) + BF (i) m liem (i)/C (a) 
Khli) = xL(-1)- Bnl- Dos — 1am (i)/Ch 1) 
Kei) = nhi) + oz) — 1)6m4i()/CÀ, (0) 
(mai) = Ym(é) — mli) bmi) 


42.2 A POSTERIORI ERROR-FEEDBACK LATTICE FILTER 


An alternative feedback form that relies on a posteriori errors, as opposed to a priori er- 
rors, can also be derived. This can be achieved by simply expressing the updates for the 
reflection coefficients in terms of a posteriori quantities and rearranging terms. For ex- 
ample, consider recursion (42.8) for the reflection coefficient x m (i), which can be written 


as 
ERE P3 FAQ) = by Oru 
kmli) = ( c) &KM(i —1) + cb) (42.13) 
Now note from Table 40.1, and from the definition of b^, (i) in (42.3), that 
4 lou)? 
ama 
PRO 
Chr (2) 


ui) 


= ym(i) h = 
so that 


bx Gr) 
RO 

2 mai |. uy Pi rhs aml) 

= Um) E D mec) | 


"yu (i) 
ya (i) 


kmli- l) + 


Note further that 


Yu i) 


b 4) — / i? 
Mun 


= NSE) | Ce 


so that 
Au). li) = Aya G) C G — 1) 


Substituting this equality into the last expression above for xm (i) we get 


A 2 Imal) tois. bi, i)ru (2) 
kmli) = UTE [eui 1) + c (42.14) 


In a similar manner, we can derive time-updates for the other reflection coefficients: 


s es mald Des DONO 
kuli) = EEG | D * xc, - 1) e (42.15) 


(42.16) 


DMO SS DEPO ES 
4 (i) A3M (0) 6 (E — 1) 


oua) E (i1) HOM 
As before, when the regression vectors have shift structure, it will hold that 
bu (i) =bu(i-1), mÒ = v G - 1, Ge = Euli- 1) 


In this case, we arrive at the a posteriori-based error feedback lattice filter of Alg. 42.2. 
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PÁ 

Algorithm 42.2 (A posteriori error-feedback filter) Let 0 << A < 1 be a for- 
getting factor and define II; = r^ !diag(A^2, A79,..., A-(M*U). Consider 
a reference sequence {d(j)} and a regressor sequence {um ;) with shift struc- 
ture and of dimension 1 x M, say, um,j = | u(j) ... w(j—- M1) |; 
For each i > 0, the M—th order a posteriori estimation error, rm(i) = 
d(i)— um, iwm, that results from the solution of the regularized least-squares 
problem: 


i 
min |Atwi mwm + X Adl) - uum? 
j=0 


can be computed as follows: 
1. Initialization. From m = 0 to m = M — 1 set: 


Ym(—1) =1, bm(-1)=0, kf (—1) = «5 (—1) = Km(—-1) =0 
GhI) =t, Gh (-1) = AF 


2. For i > 0, repeat: 


e Set y(i) = 1, boli) = fo(i) = u(i), and ro(i) = d(i) 
e From m — 0 to m = M — 1, repeat: 


Ch) = AdG-1) + fin? mli- 1) 

e) AGE (6 — 1) + [bm (i)? / yo (4) 
(mi) = mli) — |a (2/65 (4) 

)2 mf, g bs, (i)rm(i) 

su) Pe | D+ ER 
"NN mili- 1) f bo (i — 1)fm(i) 
SS Taga) cx D+ | 
T 7510] "Y fo. (t)bm (i — 1) 
mli) = ala 0 + faint 
Tm+i(t) = Tm(t) — Km (t)bm (i) 
bmti(i) = bm(i—1)—K°, (i) fa) 
1538) fe (i) — Kf,(é)bm(é — 1) 


oot ce 


42.3 NORMALIZED LATTICE FILTER 


The lattice filters studied so far require the evaluation of three reflection coefficients, 
OS ONE 


There is another equivalent lattice form that employs only two reflection coefficients. We 
denote these new coefficients by (5, (i), «5, (¢)}. The algorithm can be derived as follows. 


First, recall the definitions of {sf (i), «3 ,(2), am (1)) from Table 40.1: 683 
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smli) = pud) LATTICE FILTER 
«) = da(i)/Ch (i) 
pO = Sui) 


and introduce the modified cost 
X A Q-—134i-1 ; 
mli) = q A" + Eml) 


Define also the normalized estimation errors 


Uu) S twp Gor 
Wu È MOM GEO 
fü È fu mp OG? O 
ru È rw OO 


Comparing, for example, 54, (i) above with the angle-normalized variable b^, (i) in (42.3) 
we see that we are now further normalizing by the square-root of (5, (i). 
The normalized reflection coefficients that we are interested in are defined as follows: 


I> 


ôm (i) = kf (i). M ) «5t i) - 


ig d" acd? (i) {7 (4) OP) 


Ie 


Puli) _ NO 


"w0 maca 


That is, we scale the forward projection coefficient kl, (i) by c) /6 i ?^(i). Likewise 
for k$, (i). We now derive updates for these coefficients and show how they lead to a nor- 
malized lattice form. 


Time-Update for the Normalized Reflection Coefficients 
We start with the recursion for MO from Table 42.1: 


684 which can be expressed in terms of a posteriori errors as follows: 


ERROR FEEDBACK 2 
LATTICE FILTERS eL) = xi-1-4 OO aai) ~ kl (i — 1)By (i)] 
= du 6-12 EO feral) — el — DAD 
eu 
= ed - D+ Z0 tut - nu - 1)by(i)] 
E h- OP E) e i) 4 Sfc) 
Fm (i) å) 5 (65, (i) 
= LE! G2 "E b) fu) 
= (1 - |by,(i)| onc 1) + PROYSNO! 


Multiplying both sides of the above equality by the ratio ¢,, b/ *(i)/ OM t 2 (i), and using the 
definition of the normalized coefficient «$, (i), we obtain 


Mi) = 950 Khi- DU- BODROG dn 
However, the modified minimum costs ch) and C 1) satisfy the time-update relations 
(cf. (40.72)): 
(GO = AHG- 0) + MORAD Che) = AG - 1) + oP Am) 


These updates can be re-expressed in terms of the normalized estimation errors ( fX; (1), 54, (i) ). 
Indeed, note that 


EO -Ad,G-1) + IOLO soma c) EL 
Likewise, g 
Po Xx - 1) 
O= TOP 


Now, by taking square-roots, we have 


AG) ci^) = VACI (i - 1) 


or) eT 
1-lBGP ~ Vi- TOP 


and substituting these equations into (42.17), we arrive at a time-update recursion for the 
first reflection coefficient: 


S) = Kuli — 1)4/(1 — IG) - LG) + fub) (42.18) 


This recursion is in terms of the normalized errors (55, (i), fj (2) }. We thus need to deter- 
mine order-updates for these errors as well, which we pursue in the next section. 


In a similar manner, we can derive a time-update recursion for the second normalized re- 
flection coefficient, «{,(i). We start from recursion (42.9) for xm (i) (or from Table 42.1): 


u = rali- 1) + Den) — Anc (nar E - 0) 
= kmli = 1) + ag = by (i) (i — 1)] 
MACC ngpa OMD 
= E muti) MEUS dut) 
= "Ho(a ; big )rw() 
= (1 — Jes Gp) rul- D+ US 


Multiplying both sides of the above equality by the ratio e KO) EVON and using the 
definition of the normalized coefficient «5, (i), we obtain 


&M) = My a - 1) — 16 (il?) + ri Ger) (42.19) 


However, the quantities C5, (i) and Çm (i) satisfy the time-update relations (cf. (40.72)): 


Culi) = Auli- 1) + bM, eG) = A6 G 71) + Ir GO naQ 


These updates can be re-expressed in terms of the normalized estimation errors (55, (i), e^, (i) ). 


Indeed, note that 


b fg 
du) = adu) + BOPA star dum TAA 


Likewise, Cm (i) = AÇm (i — 1)/(1 — |r, (2)|?). Taking square-roots, we have 


2 Gr - 1) aig = VAG — 1) 


SAO eor. ES 


Chr (i) = 


Substituting these equations into (42.19), we arrive at a time-update recursion for the sec- 
ond reflection coefficient: 


MO (6 — DV - [e GP - Ir OF) + n MOI) (42.21) 


This recursion is again in terms of the normalized errors (e5, (i), bf, (i)). We thus need to 
determine order-updates for these errors as well. 


Order-Update for the Normalized Estimation Errors 
We now derive order-updates for the normalized errors {f4 (i), bH (0), b, (i), r4 (i)). 


Me start with the order-update for b wif ) from Table 40.1, namely, b Sy ) = by (i) — 
«3, (1) fm (1). Dividing both sides by eui ia), we obtain 
n _ buli)- «3, @Ofuli 
Ma) = s ST uC) 2 (42.22) 


Cura Ymy (2) 
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Now note that the quantities (C5, (i), yar (i)} satisfy the order-updates (cf. (40.72)): 
Gua) = Chet) — MOPO men G) =M — Ufa P/E 


Both of these updates can be rewritten in terms of the normalized reflection coefficient 
KS, (i): 


Ora = HOA- KUO) mea) = 3G) - LF?) 
Substituting these relations into (42.22), we obtain 


(= [Fm GTP) - s GT) (42.23) 


Ma = 


Similarly, using the order-updates for fj (i), ch (i), and 7m (i) from Table 40.1 we 
obtain 
fu) = 8 Ob) 


Hia (i) = (42.24) 


Q - bul — Int, 00) 


Observe that the order-updates for (b^, (i), f», (i)) are in terms of a single reflection co- 
efficient, namely «%,(i) and its conjugate. This is in contrast to the un-normalized lat- 
tice forms, where two separate reflection coefficients, (5, (i), T» (i)}, are needed for the 
order-updates of {bm (i), fm (i). 

Likewise, using the order-update for Çm (i) and ym (1) (cf. Table 40.1 and (40.72)), we 
can establish the following recursion: 


à - fe, (PV = es, 1) (42.25) 


Thai) = 


Although the normalized lattice filter returns the normalized residual r5, , (i), the esti- 
mation error ryz+1(7) can be recovered as follows. From the definition of r5, , ; (i) we 
have 
: 1/2 py 1/2 ps 
rmsi(i) = Ora Oa Gr ) 


which indicates that we need to evaluate the scaling factor 0,741 (i) = Ds (32, (i). 


Now using the order-update for ym (i) from Table 40.1, we have that 
ya +i (i) = yM — b (Àl) (42.26) 


Likewise, using the order-update (40.72), we obtain Qa1(?) = Gu) (1 — ]«5, (2)]2). 
Combining the above relations we arrive at the following order-update for the scaling fac- 
tor: 

om+i(t) = om(t) y. — |ru) (1 — Ong DP) (42.27) 


We should again indicate that all the relations derived so far for the normalized lattice filter 
hold independent of data structure. When the regressors have shift structure, we can relate 
(by, (i), b, (2) via bhig (i) = by, (i — 1). In this case, the recursions collapse to the follow- 
ing. 


Algorithm 42.3 (Normalized lattice filter) Consider the same setting of 
Alg. 42.2. For each i > 0, the M—th order a posteriori and a priori errors, 
{em (i), ra (1)) can be computed as follows: 


1. Initialization. Set 


o2(-1) = 971072, ((—1) = 571472, OM (71) = 0 form —0,...,M -1 


2. For i > 0, repeat: 


e Set 
cba) = Ad(-1) lu^, Gold) = Acoli- 1) + ld) P^, coli) = G^) 
KG) = f) = uli) a), rea) = ali) (3), li) =1 


e From m = 0 to m = M — 1, repeat: 


ph) = Vi- pPh@® = V1-lAOP 
pai) = vA- In GP 
Kali) = &2,(i—1)p?,(é — 1)pf (i) + fo 0s; (i — 1) 
KG) = «(i - Ph pm) + bz rn (i) 
sh) = 1- ss AB, pi = VA — Ins OP 
Toii) = OK Op n) — Ke, (ib), (i) 

1 
i) = = 1) OE 
ano) zum (i — 1) - «2: (0/2) 
H n 1 H (s a (Np fe 
fm = zEG- nas aj m 7 5m (0656 7 1) 
omia) = em) p (pa). m) = mli) (05,0). 
Tm4i() = Oma) regu) 


em+1(i) = . Tmaa)/mi (0 
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Tis interpretations we provided for the coefficients («y (i), ei, (i), «5, (1)) in Sec. 42.1, 


in terms of solutions to first-order least-squares problems, can be used to motivate yet an- 
other lattice implementation in array form. We discussed array methods and their advan- 
tages in some detail in Chapter 33. We show here that such array methods can also be 
developed for order-recursive problems. 

Thus, recall that in Sec. 42.1 we introduced the angle-normalized estimation errors 


(PG). ru Gs, fau). dÀ} 
and the corresponding angle-normalized error vectors 
{bmi TMi fMi bi) 


We then argued that the reflection coefficients {xm (i), Ki, (i), «5, (i)) can be interpreted 
as the solutions to three simple (regularized) projection problems, namely 


KM (1) projects r4, ; onto b, ; 

ii, (i) projects f4, ; onto b, ; 

K,(i) projects 54, ; onto fj, ; 
That is, each of these reflection coefficients solves the problem of projecting one angle- 
normalized error vector onto another. More specifically, they solve the following regular- 


ized least-squares problems: 


i EEE 2 ; 
minx», [n Del? + Irie = kaPre lla, | => kpli) 


K 


min,¢ | est + | 


1 fu 
fMi x KMDM 


mins, Lar" + [b - ified, 


where 
529447? and H = nd? 


The above interpretations were used in Sec. 42.1 to show that the reflection coefficients 
{ku (i), NOR k? (i)) can be time-updated by resorting to the RLS algorithm in each 
case. 
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Now in Chapter 33, we argued that least-squares solutions can also be updated in array 
form, e.g., by using the QR algorithm of Sec. 35.2. The QR method can therefore be used 
here to develop array methods for updating the reflection coefficients themselves. 


43.1 ORDER-UPDATE OF OUTPUT ESTIMATION ERRORS 


We start with the reflection coefficient xm (i). Comparing the cost function for xm (i) in 
(43.1) with the one that appears in the statement of the QR method in Alg. 35.2 we see that 
we can make the following identifications: 
w —— KM ü — 0 II] — 5 
d(i) — ry (i) ui — by (i) 


and 
Di — gH bg iM, = TATE) E d) 


If we now write down the QR equations of Alg. 35.2 for these new variables, we arrive at 
the following statement. Define the normalized reflection coefficient 


qu (i) Ê P )u() (43.2) 


Then start with QM (1) = /70-1A-M-? and qm (—1) = 0, and repeat for i > 0. At each 
iteration, find a 2 x 2 unitary matrix O y; that generates the zero entry in the post-array 
shown below, along with a leading positive entry in the first row and a positive entry s. The 
entries in the post-array would then correspond to: 


MAC 1) by) Gree 0 
AV (i—1) rg) | Oma = qM) z (43.3) 
0 1 mO) s 


where, as was the case with Alg. 35.2, the scalar quantities (s, z) can be determined from 
the identities: 


; by (I? _ mali) 

sz* = ry (i) — by likm li), Is? = 1- PMOL _ ma 

The first identity follows by equating the inner products of the second and third lines of the 
arrays, while the second identity follows from equating the norms of the last lines of the 
arrays. It is easy to see that the first expression leads to 


sz = ruli) — bw) 


" 3m (rar (i) — bar Gar i] 
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whereas the second expression leads to s = y? a0) I Yu 124) and, correspondingly, z = 
TH 41 (2). The array algorithm then becomes 


LE bi ko | Q^) 0 
Ag -1 rw) | Omi qu) ri (i) (43.4) 
0 1 MORO MaD G) 


If we further multiply the last rows on both sides of (43.4) by y ^ (i) we arrive at the array 
equation: 


qui) ripa) (43.5) 


Agili — 1) HO i) 
DMPO mra) 


0 ar (i) 


oe by (i) | GP (i) 0 
Ow, = 


This step tells us how to order-update the angle-normalized variable r^, (i). If desired, the 
reflection coefficient xm (1) can be determined from the equality 


kmli) = aw )/ G 4) (43.6) 


We now derive array methods for order-updating the angle-normalized variables { f} (i), by, (1)) 
by applying similar arguments to the other cost functions in (43.1). 


43.2 ORDER-UPDATE OF BACKWARD ESTIMATION ERRORS 


Consider the reflection coefficient xb; (i). Comparing its cost function from (43.1) with 
the one that appears in the statement of the QR method in Alg. 35.2 we see that we can 
make the following identifications 


w— Koy w-— 0 eq 
d(i) — b'g (i) ui — fyli) 
and 
9; —— HN + foi fu = ATTA x, m = di Y 


If we now write down the QR equations of Alg. 35.2 for these new variables, we arrive at 
the following statement. Define the normalized reflection coefficient 


du) 2 ds) (43.7) 


Then start with CRs —1) = V/n71A-? and gł; (—1) = 0, and repeat for i > 0. At each 
iteration, find a 2 x 2 unitary matrix e^. that generates the zero entry in the post-array 
below, along with a positive leading entry in the first row and a positive s. The entries in 
the post-array would then correspond to: 


azi Pa- fw) dr ) 0 
Aali- 1) bial) eh, = get (i) z (43.8) 
0 fuod s 


where, as was the case with Alg. 35.2, the scalar quantities (s, z) can be determined from 
the identities: 


w = Ba fud), b1- MOE _ wn 


c) 5 (i) 


The first identity follows by equating the inner products of the second and third lines of the 
arrays, while the second identity follows from equating the norms of the last lines of the 
arrays. It is easy to see that the first expression leads to 


sz” = b - fus) 
= cm [uli - fuh] 
5M © 


x 3g buit 


ER ner i) V. 5 
= 41/2 Maa) 
Im © 


whereas the second expression gives s = 42 a8) ht (i), In this way, the array algo- 
rithm (43.8) becomes 


MAGPG—1) fg) MG) 0 
M/2qb (G1) by) | hee = a(i) bi (4) 
0 1 KOPO HOPO 


(43.9) 
If we further multiply the last rows on both sides of (43.9) by 3i? (i) we arrive at the array 
equation: 


1/2 ght (i = 1) j —— qu MG) bye 41 (8) (43.10) 
0 1/2 (; fu tfc) QAO 


| MPO G-1 fu O 0 


This step tells us how to order-update the angle-normalized variable 5^, (i). If desired, the 
reflection coefficient «5, (i) can be determined from the equality 


SHORE D/P) (43.11) 


43.3 ORDER-UPDATE OF FORWARD ESTIMATION ERRORS 


Finally, consider the reflection coefficient D» (i). Comparing its cost function from (43.1) 


with the one that appears in the statement of the QR method in Alg. 35.2 we see that we 
can make the following identifications: 


w — Ki, ü — 0 Il<+— ñ 
d(i) — fuli) u;i — by (i) 


and " E 
Di — FON +b AD, = q 1X 6,6) 5 GO 
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If we now write down the QR equations of Alg. 35.2 for these new variables, we arrive at 
the following statement. Define the normalized reflection coefficient 


di) £ d" a.) (43.12) 


Then start with (AC) = y 0-1A-M-? and al, (—1) = 0, and repeat for i > 0. At each 
iteration find a 2 x 2 unitary matrix O5, ; that generates the zero entry in the post-array 


below, along with a leading positive entry in the first row and a positive s. The entries in 
the post-array would then correspond to: 


MAG —1) bg) cy O 0 
apa- fs [Ola =| LO zc (43.13) 
0 1 bu (i) (i) 


where, as was the case with Alg. 35.2, the scalar quantities {s, z} can be determined from 
the identities: 


= PO Bel 2.4, [POP _ 3) 
sz = fuli) -buku  |l-1 EO C mÒ 


The first identity follows by equating the inner products of the second and third lines of the 
arrays, while the second identity follows from equating the norms of the last lines of the 
arrays. It is easy to see that the first expression leads to 


st = fu ~ By RA) 
= wo eben 


= 27 fu+i(t) 


1/2 
am fua) 


whereas the second expression gives s — Aah 52 (a) n ?^(i) and, consequently z = 
Jy 43i). In this way, the array algorithm (43.13) becomes 


XA G1) By (i) ay) „0 
AMagie (i — 1) fi) Ofi = i) fii, 
0 sacl O WO À 


(43.14) 
If we further multiply the last rows on both sides of (43.14) by 7 51, 1 i) we arrive at the 
array equation: 


ach = 1) 
VAM SC d i Cass “(i * (i (43.15) 


0 MOLO 30 


This step tells us how to order-update the angle-normalized variable f; (i). If desired, the 
reflection coefficient Ki, (i) can be determined from the equality 


khu) = du) (43.16) 


43.4 SIGNIFICANCE OF DATA STRUCTURE 


As we already know, when the successive regressors have shift structure it holds that 
bu) = bu -1, bul) = Uu G-1, Au) y«G-1, Cu) = r7 0 


and we are led to the array-based lattice algorithm, also known as the QRD-based lattice 
filter — see Fig. 43.1; the qualification "QRD-based" is used to indicate that the array 
recursions correspond to QR decompositions of the corresponding pre-arrays (recall the 
third remark following the statement of Alg. 35.1). 

For comparison purposes, Table 43.1 lists the estimated computational cost per itera- 
tion for the various lattice filters derived in this chapter assuming real data. The costs are 
in terms of the number of multiplications, additions, divisions, and square-roots that are 
needed for each iteration. It is seen that lattice filters generally require O(20M) operations 
per iteration. 


FIGURE 43.1 The QRD-based lattice filter. 
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Algorithm 43.1 (Array lattice filter) Consider again the same setting of 
Alg. 42.2. For each i > 0, the M—th order a posteriori estimation error, 
rM(i) = d(i) — uu iwi, that results from the solution of the regularized 
least-squares problem 


i 
min | At wi mwm + XO 73aG) -um UM? 
WM 
j-0 
can be computed as follows: 


1. Initialization. From m — 0 to m = M — 1 set: 


dPCn- vmm, dono Yers 


Qm(-1)=0, gf(-1)=0, gh(-1)20, bm(-1) =0 


2. For 4 > 0, repeat: 


e Set 41? (i) = 1, bh (i) = fi) = u(i), and r(i) = d(i) 
e For m = 0 to m = M — 1, apply 2 x 2 unitary rotations e s 


Omi; and Ob, ,, with positive (2,2) entries, in order to annihilate 
the (1, 2) entries of the post-arrays below: 


AGa- bzi- lor. Mg 0 
| AU gl ^ (d — 1) p Je ke RO qua | 
APO -1) b) m (i 0 
1/25* fj ETA z Mes K 
d*/*q* (i — 1) © On,i rs Eo 
0 Ym - (1) Ey )- nai) 
AR -1 fa) MC 
| AV?gP(i-1) bx(i-1) je. d^ Lo TN" | 


and set Tm4i(t) = ra (12, (i). 


—  HÉÁ———O! 


TABLE 43.1 Estimated computational cost per iteration for various lattice filters. 


A posterior tice —— [18M (eM | aa | - | Algorithm 41.1 
Apr eror feedback ance | teh [8M [2m — Algorithm 42.1 
E poserar snor Tentas in [M [Sir [rat —| Algorithm 42.2 
|... Normalized lattice | 18M | 5M | 3M | 5M | Algorithm 42.3 
2M | Algorithm 43.1 


Reference 


Summary and Notes 


Te chapters in this part describe several order-recursive (lattice) implementations of RLS. 


SUMMARY OF MAIN RESULTS 


1. 


The lattice forms are primarily concerned with order-updating the output estimation error and 
not the weight vector itself. To do so, forward and backward prediction errors also need to be 
order-updated. 


. All order-update relations derived in the chapter hold irrespective of data structure. As indi- 


cated in Fig. 41.1, the only place where the structure of the regressors is relevant is in knowing 
how to generate the error quantities bm (i) from the error quantities bm (i): both these errors 
are related to backward projection problems. 


. When the regressors possess shift structure, it holds that bm (i) = bm(i — 1). There are 


situations where the regressors are not shifted versions of each other and yet one can still 
relate {bm (i), ba (7) } — see, e.g., Chapter 16 of Sayed (2003) on Laguerre lattice filters and 
Merched and Sayed (2000b,2001a). 


Seven lattice forms are described in the text: (a) a posteriori lattice form, (b) a priori lattice 
form, (c) a priori lattice form with error feedback, (d) a posteriori lattice form with error 
feedback, (e) normalized lattice form, (f) array lattice form, and (g) Givens-based lattice form 
(described in Prob. X.7). 


. The lattice forms with error feedback are based on time-updating the three reflection coeffi- 


cients {xm (i), «1, (i), &5, (i) ). The standard a posteriori and a priori lattice forms are based 
on evaluating the reflection coefficients as ratios. The normalized lattice form, on the other 
hand, uses only two reflection coefficients. The array form uses three Givens rotations and 
updates angle-normalized errors. The Givens-based lattice is simply the array form with the 
equations spelled out. 


. The array lattice form seems to be the most reliable in finite precision along with the a priori 


form with error feedback. The a posteriori lattice form with error feedback seems to be the 
least reliable. A computer project at the end of this part illustrates these behaviors. 


. The derivations of lattice filters in this book account for regularization. Moreover, the deriva- 


tion of the time-updates for the reflection coefficients exploits the useful observation that these 
coefficients can be interpreted as solutions to least-squares problems in their own right and, 
therefore, that RLS updates can be used to time-update them. 


BIBLIOGRAPHIC NOTES 


Regularization. In comparison to conventional derivations of lattice filters in the literature, we 
have incorporated regularization into our derivations. 
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Pre-windowed data. In this book we assume pre-windowed data, which is the case of most interest 
in practice. In pre-windowing, the input data (u(i), d(i)) are assumed to be zero prior to filter 
operation, i.e., u(i) = d(i) = 0 for i < 0. Other forms of data windowing are possible (see, e.g., 
Honig and Messerschmitt (1984) and Alexander (1986)). 


Lattice forms. The a posteriori-based lattice filter was developed by Morf and Lee (1978) and 
Lee, Morf, and Friedlander (1981). Further analysis and geometric derivations appear in Friedlander 
(1982) and Lev-Ari, Kailath, and Cioffi (1984). The idea of updating the reflection coefficients 
directly in order to improve the numerical properties of finite-word-length implementations, as in 
the a priori and a posteriori error-feedback lattice forms, is due to Ling, Manolakis, and Proakis 
(1985,1986). 


Numerical issues. In Levin and Cowan (1994), the performance of several recursive least-squares 
algorithms are compared in a finite-precision environments. The results suggest that the array lattice 
form has superior numerical properties; a fact that agrees with the simulation results in the computer 
project at the end of this part. 


Array derivation. The idea of using pre- and post-arrays to derive the array-based lattice filter of 
Chapter 43, also known as the QRD-lattice filter, is from Sayed and Kailath (1994b). 


QR-based lattice. The idea of using the QR method as the backbone for deriving fast recursive- 
least-squares filters was put forward independently in the late 1980s by a number of authors includ- 
ing Cioffi (1988,1990), Bellanger (1988), Proudler, McWhirter, and Shepherd (1988,1989,1990), 
and Ling (1989,1991). The papers by Proudler, McWhirter, and Shepherd (1989,1990) and Ling 
(1989,1991) specifically derive lattice structures; the first work relies on the QR-decomposition 
while the second work uses the modified Gram-Schmidt orthogonalization procedure; both lattice 
algorithms are essentially the Givens-based lattice form of Prob. X.7; see also Proakis et al. (1992). 
The relation between fast QR methods and lattice filters was discussed by Regalia and Bellanger 
(1991). Related work on fast QR least-squares algorithms can also be found in Rontogiannis and 
Theodoridis (1998). 


Prediction theory. There has been extensive work on lattice structures for least-mean-squares (as 
opposed to least-squares) estimation and prediction since the late 1940s. These earlier investigations 
have a rich history and they relate to contributions by Szegó (1939) on orthogonal polynomials and by 
Levinson (1947) and Durbin (1960) on the linear prediction of stationary time series. The reflection 
coefficients of the resulting lattice prediction filters are usually referred to as PARCOR (i.e., as partial 
correlation) coefficients, since they relate to the correlation between forward and backward prediction 
errors — recall Probs. II.31 and II.32. It is instructive to compare the recursions of Prob. II.31 for 
the least-mean-squares lattice filter with the recursions of the a posteriori lattice filter of Alg. 41.1; 
compare also with the normalized lattice filter of Alg. 42.3. One of the earliest applications of these 
lattice least-mean-squares (prediction) filters was in the context of speech analysis and synthesis by 
Itakura and Saito (1972). 


State-space derivation. In Sec. 42.1 we showed that each one of the reflection coefficients, 
{xm (i), eh (å), #4, (i)}, can be interpreted as performing the projection of a normalized error vec- 
tor onto another normalized error vector. In other words, each one of these coefficients can be 
regarded as the solution to a first-order regularized least-squares problem. In this way, by invoking 
the equivalence that exists between RLS and Kalman filtering (cf. Sec. 31.2), we should be able to 
derive lattice filters by equivalently formulating state-space estimation problems of first-order for 
updating each of these reflection coefficients. This is indeed the case, as was shown by Sayed and 
Kailath (1994b). This point of view provides another convenient way of deriving lattice filters — see 
Probs. X.11-X.13. 


Batch versus sequential processing. Assume the total number of data (i.e., samples) that is 
available for processing by a lattice filter is N + 1, say (d(j),u(j), 0 € j € N}. At each time 
instant i, the estimate dm (i) computed by the lattice implementation is an estimate of d(i) of order 


M and it is based on regression data up to and including time i. Specifically, dm (i) = umiwn,i 
where um is the M —th order regressor at time 4 and wy; is the weight-vector solution at time 
ias well. All lattice filters described in this part, as well as all other least-squares adaptive filters 
described in the previous chapters, operate in this sequential manner. Namely, the signal estimates 
that are available at a particular time instant are based on the regression data up to that same time 
instant. This fact is in contrast to a batch least-squares solution (cf. Chapter 29), whereby all N 4- 1 
regression vectors and reference data are used to generate the signal estimates. For instance, in 
a batch implementation, the estimate of d(i) would be computed as d(i) = um,iwa,n, where 
wan is the weight estimate that is based on all data (i.e., up to time N). There have been several 
investigations in the literature pertaining to the issue of how to approximate a batch solution from 
a collection of sequential solutions. Such investigations have been conducted most notably in the 
coding and prediction theory literature, e.g., by Rissanen (1984,1986), Ryabko (1984,1988), Wax 
(1988), and others. In later works by Merhav and Feder (1993,1998), Singer and Feder (1999), and 
Singer, Kozat, and Feder (2002), the authors examined the relation between batch and sequential 
processing in the context of adaptive filtering. For example, Singer and Feder (1999) showed how to 
select some weighting coefficients {nm (i)} and constructed a so-called universal predictor for d(i), 
denoted by d, (i), by combining the sequential order-recursive estimates {dm (i)} that are generated 
by a lattice implementation as du (i) = eae Nm(i)dm(i). The performance of this universal 
solution was then shown to approximate reasonably well that of the batch solution; specifically, 
the performance of both batch and universal solutions as measured in terms of the energies of the 
resulting residual vectors (over all data) were shown to be within O(N^! In N) of each other. 
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PROBLEMS 


Problem X.1 (Forward and backward residuals) Assume the regressors have shift structure 
and use the following updates for the conversion factor from Table 40.1, 


agi (i) = mli- 1) - fM Oha), rea G) = ye (4) — be GO / 0) 


to establish the identity 


Bara COP 
GON 


Rivest eae 


dao 


by i= MO y 
eM G7 eO 


Problem X.2 (Conversion factor) Show that "bat i)- [1 - Ye (Ib; CP? (à) nio 


Problem X.3 (Prediction vectors) Assume the regressors {um,;} have shift structure. Use the 
discussion in Secs. 40.2—40.4, and the results of Lemmas 32.1 and 32.2, to argue that the weight 
prediction vectors (wi, wy i} satisfy the relations 


b 0 bep 1 
VMaii = + &w(i) 
Exc | Ub ii | ~ whe 


f b 
Wa ; —UM.i— 
| d: | + wt ol m 1 | 


Problem X.4 (Prediction errors) Introduce the shift operator g~'u(i) = u(i — 1). Applying an 
operator A(g ) = S77, a«(i)g ^ to a sequence u(i) results in A(q^! u(i) = Yu o ax(i)u(i — 
k). Show that the forward and backward prediction errors of order M + 1 can be generated as 


follows: m 
bua) | _ «o0 GO |) fu) 
fai) (11 | -q 1 J | u(i) | 


Problem X.5 (Reflection coefficients) Starting from recursion (42.8), and using the time-update 
(42.4) for Q5, (i), show that, when the regressors have shift structure, 


f 
WM+1,i 


BERS PED Bi) 
Likewise, show that 
42 AXdG-2).,,. belt) ss. 
sh = iic) CD + ca 
f P 
du) = MS Dag + By 1 


e c) 
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Problem X.6 (Givens rotations) Refer to the statement of the QRD-based lattice filter in Alg. 43.1. 


The unitary matrix O;,,; is a Givens rotation of the form 


s 
Sura e | 1 =| where jio mU). 2 


"oir [ 1 Ach. (i — 1) 


Verify that the entries of Om, and e can be identified as 


CN | NRU- DRO OVE | a [ hO -hA 
| b (i)/ JC) Jr. G = D/C sk) d) 
V Ad - 1)/6.) -fmG/VGRG) | a | ei) mud 
FZO AA Adi = 1/6.) sm) chli) 


b 
Oni 


Problem X.7 (Givens-based lattice) Use the identifications of Prob. X.6 to deduce from the 
QRD-based lattice filter of Alg. 43.1 the following equivalent form, which propagates angle-normalized 
errors: 


1. Initialization. From m = 0 to m = M — 1 set: 
2/7 (-1) = V/g- 1472, GP (-1) = VTAT, am(-1) = a] (71) = a (71) = 0 
2. For i 2 0, repeat: 


e Set yo(i) = 1, bo (2) = fo(i) = u(i), and ro(i) = d(i) 
e Form =Otom = M —1do: 


2 1/2 

dew = h(a- «wor 
1/2 

ve = boy eser] 
ew = AP i- 0/00 () 
s) = bhli) (i) 
d = DA UR D/O 
h(i) = hOLA 
fantd = ob (i 1)f4() — A? sz — gh — 1) 
dh) = Ahli- hli) + sh fhal) 
Maa) = d -1) — AVR i(i- 1) 
BO = Ahh- 1) + sh (Hn (i - 1) 
mili) = em (rm) — AV?sh* (jg, (i — 1) 
qm (i) = A6 (i)gm(i-—1) + sh) 
xh) = AOE 
Tm+i(t) = rns (ivy? (i) 


Remark. As can be seen, the Givens-based lattice filter simply amounts to identifying explicitly the parameters 
of the rotation matrices, and subsequently expanding the equations of the QRD-based lattice filter. 


Problem X.8 (Minimum cost updates) In the Givens-based filter of Prob. X.7, explain why it 
is more convenient to update the quantities (cil ? (i), C2? (i)) as indicated in that problem rather 
than use the following updates, which follow from the QRD-based implementation (cf. Alg. 43.1): 


cu) = AP d.e? - 1) sha) fali) 
Wa) = XP dg -1) sx) ) 


699 


7a 
Part X 
PROBLEMS 


700 


Es) 
Part X 
PROBLEMS 


Problem X.9 (Relation between RLS and lattice filters) The purpose of this problem is to 
establish a relation between lattice filters and the standard RLS solution (40.12). It turns out that 
there is a relation between the weight vector wm,; and the backward prediction vectors {w? ,) of 
increasing orders j — 0,1,..., M — 1. Iterate recursion (40.67) starting from the initial condition 
wo,i = 0 and up to j = M — 1, to establish that 


Ko (2) 
K1 (i) 
&a(i) 


KM -1(i) 
where the columns of the upper-triangular matrix are defined in terms of the backward prediction 
vectors 
b 
| x | , j=0,1,...,M —1 


Remark. This result shows that wm,; and the vector of reflection coefficients («; (2)) at time i determine each 
other uniquely. 


Problem X.10 (Backward prediction errors) Show that the backward prediction errors of suc- 
cessive orders (the ones appearing in Fig. 41.2) are related to the input sequence via the relation: 


[6G &G) ss bua ] = umi Uns 
where U m,i is the same upper triangular matrix appearing in Prob. X.9. 


Problem X.11 (State-space formulation of lattice filters) In Sec. 42.1 we explained that the 
reflection coefficients {xm (i), «1, (i), «5, (i)) can be interpreted as solutions to first-order least- 
squares problems. Specifically, we introduced the angle-normalized vectors (b, ;, rA; ;, f i Di) 
and showed that {x m (i), «1, (4), «5, (1)) are the solutions to the following three (regularized) pro- 
jection problems: 


mings, [5 1A km]? + Ire = Kata]. = «m(i) 


l — y2 
minst, AIA hf? + ls. X sibus] = «i 


: Pare - 2 i 
ming, [IAR + [Bia eft] = nta) 
where 7j = AM? and jj = A?. That is, 


&M(i) projects ray; onto 54; ; 
Ki,(i) projects fum, onto b, ; 
Ki, (i) projects bas; onto f, ; 


These interpretations were used in Sec. 42.1 and Chapter 43 to derive different forms of lattice filters 
by solving first-order least-squares problems. 

In this problem, as well as in Probs. X.12-X.13, we shall derive the same filters by appealing 
instead to the equivalence relation that exists between RLS and Kalman filtering. Thus recall from 
the discussion in App. 31.2 that we showed that the exponentially-weighted RLS algorithm can 
be obtained via equivalence to the Kalman filter as follows. We start from the state-space model 
(31.8) and write down the Kalman recursions for estimating its state variable. Then we translate the 
Kalman variables into the RLS variables by using the correspondences summarized in Table 31.2. In 
Probs. X.11-X.13, we want to use this equivalence construction in order to re-derive lattice filters. 
As a result, we shall clarify the connections between lattice filters and Kalman filtering. 


(a) Let us first consider the problem of projecting f4; ; onto bj, ; in order to evaluate «7, (i). Use 
the equivalence between RLS and Kalman filtering from App. 31.2 to argue that this problem 


can be solved by introducing the following one-dimensional state-space model: 


eit 1 =A Peti),  yf(i) = b (i)a (4) + vf (1) 
with E|z/(0)? = A715 and Ev/(i)v/*(j) = 2o and where the variables (2 (i), y (i)) 
are identified with the least-squares variables {kf ip du )) as follows: 


fia Fh (i) 
(2) Sy (VA)i 


f 
f x Ky 

T (i) a (VAi ’ 

(b) Consider next the problem of projecting bui onto fu in order to evaluate kb, (i). Argue that 

this problem can be solved by introducing the following one-dimensional state-space model: 

a^ (i +1) = A^? (i), fM (i)m (i) + v*(i) 

with E[z^(0)|? = A75 and Ev*(i)v*(5) = pu. and where the variables {x° (i), y*(i)] 
are identified with the Tus -squares variables (3 HT ^u (i)) as follows: 


pi — Ae 


z^(i — aye 


ES 
We 
(c) Consider now the problem of projecting r5, ; onto b^; ; in order to evaluate «m (i). Argue that 

this problem can be solved by introducing the following one-dimensional state-space model: 


z(ic-1)-A-V?sa(i, y(i) = by (i)m(i) + v(i) 


with E |z(0)/? = A7! z and Ev(i)v" (j) = ôij, and where the state-space variables {æ (i), y(i)} 
are identified with the pena variables {xm , rm (i)) as follows: 


m 
ai) — dA v — DS 


Remark. For further discussion on the connections between lattice filters and Kalman filters, see the article by 
Sayed and Kailath (1994b). 


Problem X.12 (Array lattice and Kalman filtering) Continue with the same setting as in Prob. X.11. 


(a) Consider the state-space model of part (a) of that problem, which relates to projecting fm, 
onto bj m,i» Use (35.36) to verify that the information array form of the corresponding Kalman 
filter is given by: 


AV29b/2 (iyi — 1) AU?2b (i) $?/?(i + 1|) 0. 
&^*Gi-ieM*ui-1)  wf*G) | Ole =| af Gt os Gt)  vi*()rz "^ () 
0 1 kè* (1) 5/2 (4 + 14) rz P?) 


where Of M, LIS any unitary rotation that introduces the zeros in the post-array and where we are 
using the notations (ó^(i), k Pi), v (i), v P) to denote the variables {®;, Kp,i, vi, Rei) 
that appear in the general formulation (35.36). Moreover, v^ (i) = y (i) — bh, (i) (i|i — 1). 
Use the correspondences from Table 31.2 to verify that the above array equations lead to the 
RLS array equations (43.14) for updating the forward estimation error. 

(b) Consider now the state-space model of part (b) of Prob. X.11, which relates to the problem of 
projecting bui onto fm,- Verify that the corresponding information array form is given by: 


AV29f/2(ili — 1) AU? fs (i) $f/2(i + 1|i) 0 
a^"(ii-1)9/?(ji-1)  y*() | Ole =| aatia irr A 


0 1 kÍ* (6f 2 (i + 1]i) re ^"? (i) 
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(c) 


a 


where es; is any unitary rotation that introduces the zeros in the post-array, and where 
v* (i) = y*(i) — fA (i)& (ili — 1). Use the correspondences from Table 31.2 to verify that 
the above array equations lead to the RLS array equations (43.9) for updating the backward 
estimation error. 

Finally, consider the state-space model of part (c) of Prob. X.11, which relates to the problem 
of projecting riy, onto b; ;. Verify that the corresponding information array form is given 
by: 


AU29/2 (ili — 1) 1/255 (i) $^/? (4 + 12) 0 
Gli-1)9]2(i—1) — y*() Ox, | &'(i--i)eP 2i) wir) 
0 1 kb* (i)99/2 (i + 1\2) rz? (i) 


where Om, is any unitary rotation that introduces the zeros in the post-array, and where 
v(i) = y(t) - bu ()&(i|i — 1). Use the correspondences from Table 31.2 to verify that 
the above array equations lead to the RLS array equations (43.4) for updating the output 
estimation error. 


Remark. In other words, each of the array equations in the QR-based lattice algorithm is a special case of the 
array information form of the Kalman filter (Sayed and Kailath (1994b)). 


Problem X.13 (Error feedback and Kalman filtering) Continue with the same setting as in 
Prob. X.11. 


(a) 


(b 


— 


(c) 


Consider the state-space model of part (a) of that problem, which relates to projecting f M 
onto b, ;. Verify that the corresponding Kalman recursion for the state estimator is given by: 


af (i 14) = A7? (ili — 1) + AV? p*( + 18) Be Dy! (2) — B DE! (4i — 1)] 
where the inverse of the Riccati variable p" (i) satisfies 
p (i+ 1i) = dp (i -1) + fw. pO- 1) = ag 


Now use the correspondences from Table 31.2 between RLS and Kalman filtering variables, as 
well as the relation between {bm (1), bm (i) ). to verify that the above state estimator equation 
leads to the following update for the reflection coefficient from Table 42.1: 


wh (i) = 02, (i — 1) + Bu Oam iamli) a (3) 


Consider now the state-space model of part (b) of Prob. X.11, which relates to projecting bui 
onto f4;.;. Verify that the corresponding Kalman recursion for the state estimator is given by: 


(i+ 1i) =A ji - 1) + PF E+ UD fry — fi Ge Gli - 1) 
where the inverse of the Riccati variable p’ (i) satisfies the recursion 
pc) = Ap lili- 1) + mOl, » (01-1 = a 


Now use the correspondences from Table 40.1 between RLS and Kalman filtering variables, as 
well as the relation between { fy, (i), fm (i)}, to verify that the above state estimator equation 
leads to the following update for the reflection coefficient from Table 42.1: 


ruli) = ruli- 1) + edu) GB li) GG) 


Finally, consider the state-space model of part (c) of Prob. X.11, which relates to projecting 
ri onto bh, ;. Verify that the corresponding Kalman recursion for the state estimator is 
given by: 


&(i + 1i) = A72 &(i[i — 1) + AV? p(i + 1i)by D[y(4) — Ong (i) (ii — 1) 


where the inverse of the Riccati variable p(i) satisfies the recursion — 103 
Part X 
p^ 1) = Xp" Gli - 1) + GP, »7 0| 1) = ag? PROBLEMS 


Now use the correspondences from Table 40.1 between RLS and Kalman filtering variables, as 
well as the relation between {bhs (1), bm (1) ), to verify that the above state estimator equation 
leads to the following update for the reflection coefficient from Table 42.1: 


ka (i) 2 &u (i — 1) + Bu Oa (iem (3/6 3) 


Remark. In other words, each of the update equations for the reflection coefficients in error-feedback lattice forms 
is simply a special case of the prediction equation of the Kalman filter (cf. Sayed and Kailath (1994b)). 


COMPUTER PROJECT 


Project X.1 (Performance of lattice filters in finite precision) Although equivalent from a 
theoretical point of view, the performance of the varied lattice filters differ under finite-precision 
conditions. The purpose of this computer project is to illustrate these differences, as well as illustrate 
the recovery mechanism of some of the filters during the occurrence of impulsive interferences. 


(a) Generate 10 random coefficients of a channel and normalize its energy to unity. Feed unit- 
variance Gaussian input data through the channel and add Gaussian noise to its output. Set 
the noise power at 30 dB below the input signal power. Choose Aà = 0.999 and 7 = 10$ 
and train the following lattice filters using the input sequence of the channel as input to the 
lattice implementations and the noisy output of the channel as the reference sequence: 1. A 
posteriori lattice form; 2. a priori lattice form; 3. a priori lattice form with error feedback; 
4. a posteriori lattice form with error feedback; 5. normalized lattice form and 6. array lattice 
form. Assume in your simulations a finite-precision implementation with B bits for signals 
including the sign bit; use the routine quantize.m from Computer Project IX.1. For each 
algorithm, generate an ensemble-average learning curve by averaging over 50 experiments of 
duration N = 200 iterations each for the following choices: 1. B = 35 bits; 2. B = 25 bits; 
3. B = 20 bits; 4. B = 16 bits and 5. B = 10 bits. Which lattice forms appear to be most 
reliable in finite precision? 


(b) For this part, assume first a floating-point implementation. Introduce an impulsive interference 
of unit amplitude to the input sequence at time instant à = 200. Generate ensemble-average 
learning curves for the lattice filters over N = 500 iterations and observe whether they recover 


from the impulsive disturbance. 


— 


(c) Repeat the simulations of part (b) in finite precision using B = 20 bits and B = 10 bits. 
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Indefinite Least-Squares 


W. end our treatment of adaptive filtering in this book by studying the robustness of 
adaptive filters in the presence of disturbances and uncertainties in the data. A study of 
this kind requires that we first define what we mean by robustness. For our purposes, 
and in loose terms, a robust filter will be one for which small disturbances in the data 
do not degrade the performance of the filter appreciably. The measure of smallness and 
largeness of a signal will be chosen as its energy, so that a robust filter will be one such 
that disturbances with small energy cannot lead to estimation errors with large energy and, 
more generally, the estimation error energy will remain bounded as long as the disturbance 
energy is bounded. 

There are of course other characterizations of robustness. The one described above 
lends itself to analysis and mathematical manipulations. In particular, its characterization 
will involve studying quadratic cost functions that share many of the characteristics of 
regularized least-squares, except for the appearance of indefinite weighting matrices (as 
opposed to positive-definite weighting matrices). For this reason, many of the features of 
least-squares solutions will manifest themselves again in this part of the book; albeit in 
modified forms that result from the presence of indefinite weights. 


44.1 INDEFINITE LEAST-SQUARES FORMULATION 


The notion of robust adaptive filters will be defined in mathematical terms in Secs. 45.1 
and 45.3. At that point, it will become clear that the design of robust filters rests on the 
minimization of certain regularized quadratic cost functions with indefinite weighting ma- 
trices. For this reason, in this section and in Sec. 44.2, we study such cost functions under 
rather general conditions. Then in Secs. 45.1 and 45.3 we specialize the ensuing theory to 
the design of robust filters. 

The indefinite least-squares problem that we study is a variation of the regularized least- 
squares problem studied earlier in Sec. 29.8. Thus, given an N x 1 measurement vector 
y, an N x M data matrix H, an N x N Hermitian matrix W, and an M x M Hermitian 
matrix II, we consider the quadratic cost function 


J(w) $ wlw + (y- Hw)'"W(y - Hw) (44.1) 


where w is M x 1. The matrix W plays the role of a weighting matrix except that now it 
is not required to be positive-definite or even invertible any longer (it can have both pos- 
itive and negative eigenvalues). This is in contrast to the weighted least-squares problem 
(29.37) where W was taken to be positive-definite. Likewise, the matrix II plays the role 
of a regularization matrix, except that it too is not required to be positive-definite or even 
invertible, as was the case with the regularized cost function in (29.37). 
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In Sec. 29.8, with positive-definite matrices {II, W}, we found that J(w) always had a 
unique minimizing solution (cf. Thm. 29.5). However, when the matrices (II, W} are not 
necessarily positive-definite, as is the case under consideration, other scenarios can occur 
and the minimization problem may not even make sense. To explain what can happen in 
this general case, we follow the same completion-of-squares argument that we employed 
in Sec. 29.3. In this way, the reader will be able to appreciate the parallels between both 
treatments. 

We first rewrite J(w) as 


Jw) 2 [ y" ELM neu Go) 


and proceed to express it as the sum of two terms: one is dependent on the unknown w 
and the other is independent of w. The sum will allow us to examine the behavior of J(w) 
in some detail. However, in order to present the main ideas without much worries about 
technicalities, we distinguish between two cases. We treat below the case of an invertible 
coefficient matrix, II + H*W H, which is the case of interest to our treatment of robust 
filters. Observe that the matrix II + H*W H need not be invertible even when H has full- 
column rank and (II, W} are both invertible. Consider, for example, the choices IT = I, 
H = col(LI) and W = diag(I, —2I}, which lead to II + H*WH = 0. Such situations 
can never arise when (II, W} are positive-definite since then II + H*WH > 0. For this 
reason, there was no need to distinguish between singular and nonsingular coefficient ma- 
trices in Sec. 29.8. Appendix 17.A of Sayed (2003) treats the general case of possibly 
singular II + H*WH. 


Invertible Coefficient Matrix 
Assume II + Z*W H is invertible. Then the center matrix in (44.2) can be factored as the 
product of upper-triangular, block-diagonal, and lower-triangular matrices: 


W -WH _ [1 -WH(H-H'WH)^! 
-H'W H-H'WH! ~ |0 I 
W -WH(I- H'WH)-H*W 0 
0 I+ H*WH 


I 0 
| -(I- H*WH)-H*W I | 
(44.3) 


Substituting the right-hand side into (44.2) and expanding leads to the representation: 


where we introduced the column vector 


à Ê (1+ FWH)-H*Wy (44.5) 


In (44.4), J(w) is expressed as the sum of two factors, and only the rightmost factor de- 
pends on the unknown w. 

If our aim is to minimize J (w), then it does not follow any longer that the choice w = i? 
minimizes J (w), even though this choice for w cancels the rightmost factor in (44.4). This 
is because the coefficient matrix (II + H*W H) is in general indefinite so that the choice 


w = @ could correspond to a minimizer of J(w), a maximizer of J(w), or neither, as we 
explain below. Contrast this situation with the one we encountered in (29.15) where the 
coefficient matrix was H* H and, hence, nonnegative definite, and the choice w = ® could 
only be a minimizer. . 

In order to examine the nature of the solution (44.5), we consider the following three 
possibilities for the coefficient matrix: 


(a) O+ H*W H > 0. In this case, the second term in (44.4) is nonnegative for all w, 
(w — &)* (II+ H*WH)(w - 0) 20 
and it is zero only phen as = @. Consequently, from (44.4), 
J(w) > y*[W - WH(I + H'WH) | H'W|y 


with equality only when w = @. This means that J(w) attains its global minimum 
at w = @ so that the minimization problem below has a unique solution at à: 


minJ(w) — 9 -—(II- H*'WH)-1H*Wy when II+ H*'WH »0 


w 


Of course, saying that (P is the minimizing solution of J(w) means that if we start 
at the point w = 4? and modify w in any direction, the cost function J(w) can only 
increase in value. Figure 44.1 shows a typical plot of a quadratic cost function with 
a global minimum. 


J(w) | 


— 


FIGURE 44.1 A typical plot of a quadratic cost function J(w) with a global minimum for the 
case in which w is two-dimensional, say, w = col{a, 8). The plot also shows the contour curves of 
J(w). 

(b) I+ H*WH « 0. In this case, the second term in (44.4) is non-positive for all w, 


(w — G)* (II-- H*WH)(w —@) <0 
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and it is zero only when w = €). Consequently, from (44.4), 
J(w) € y*[W-WH(1+ F'WH) !H*W|y 


with equality only when w = ®. This means that J(w) attains its global maximum 
at w = W, so that the maximization problem below has a unique solution at the 
specified 0: 


Again, saying that ®© is the maximizing solution of (44.1) means that if we start at 
the point w = 4) and modify w in any direction, the cost function J(w) can only 
decrease in value. Figure 44.2 shows a typical plot of a quadratic cost function with 
a global maximum. 


J(w) 


FIGURE 44.2 A typical plot of a quadratic cost function J(w) with a global maximum for the 
case in which w is two-dimensional, say, w = col{a, 3}. The plot also shows the contour curves of 


(c) 1+ H*W H is indefinite. In this case, the second term in (44.4) can be of any sign 


(positive or negative). The term is still zero when w = @& (but it can be zero at 
other choices for w as well — see Prob. XI.1). We say that J(w) has a saddle point 
atw = @. A saddle point is such that if we depart from w = @, then the cost 
function increases along some directions and decreases along others. To explain 
this behavior, introduce the eigen-decomposition II + H*WH = VAV*", where 
the diagonal entries of A, {A1, A2,..., Am }, can be positive or negative, and V is 
a unitary matrix, i.e., VV* = V*V =I. Assume, for illustration purposes, that 
the first diagonal entry of A is positive with a corresponding eigenvector v1, while 
the last diagonal entry of A is negative with a corresponding eigenvector vy. Now 
suppose we choose w such that (w — ©) = avı, for any nonzero scalar o. This 
means that starting from © we are modifying w along the direction of vı. Then the 


second term in expression (44.4) for J (w) evaluates to a positive value for all a # 0, 
(w — 0)" (M+ H*WH)(w—@) = lo >0 

It follows that the cost function will increase along this direction. On the other hand, 

if we choose w such that w = ®© + avy, so that starting from @ we are modifying 

w along the direction of vm, then 


(w — 0)*( + H*TWH)(w - à) = |oAy «0 


which means that the cost function will decrease along this direction. Figure 44.3 
provides an example of a quadratic cost function with a saddle point. 


CU 


J(w) 


am - 
FIGURE 44.3 A typical plot of a quadratic cost function J(w) with a saddle point for the case in 
which w is two-dimensional, say, w = col(o, 3}. The plot also shows the contour curves of J(w). 


In all three cases (a)- (c) considered above, we say that © is the stationary point, also 
called the critical point, of the cost function J(w) (a stationary point is a point at which 
the gradient of J(w) with respect to w evaluates to zero — see App. 44.A). We therefore 
refer to the process of determining w as the process of stationarizing J(w). In summary, 
we arrive at the following statement. 


Theorem 44.1 (Invertible coefficient matrix) Consider the problem of station- 
arizing the quadratic cost J(w) in (44.1), where (II, W} are Hermitian ma- 
trices such that II + H*WH is nonsingular. The following facts hold: 


1. J(w) has a unique stationary point at © = (II + H*WH)-! H*Wy, 
and 
J(@) = y*[W —-WH(I+ H'WH) H*W|y 


2. The vector à is a global minimum if, and only if, I+ H*W H > 0. 
3. The vector ® is a global maximum if, and only if, I + H*WH <0. 


4. The stationary point is a saddle point, otherwise. 
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44.2 RECURSIVE MINIMIZATION ALGORITHM 


Theorem 44.1 characterizes the stationarization of quadratic cost functions under mild con- 
ditions on the matrices {II, W}; these matrices are only required to be Hermitian but could 
be singular. However, it is usually the case that (II, W are invertible, i.e., 


(IL, W} are invertible Hermitian matrices (44.6) 


In addition, since we are mostly interested in the case when the cost function J(w) has a 
unique minimizing solution, we shall require {II, W} to also satisfy 


Tl+ H*WH >0 (44.7) 


For these reasons, from now on, we shall assume that conditions (44.6)-(44.7) hold; the 
matrices (II, W} themselves could still be indefinite. 

Our objective now is to develop a recursive algorithm that time-updates the minimizing 
solution of J(w), in a manner that is similar to the recursive least-squares (RLS) algorithm 
of Sec. 30.2. However, two complications arise in comparison to our treatment of RLS. 
First, we need to pay particular attention to the existence condition (44.7) in order to guar- 
antee that the successive iterates do indeed correspond to minima of the corresponding cost 
functions. Second, we need to derive a multi-channel version of the recursive algorithm 
(as we did for RLS in Prob. VII.36), whereby the measurements are taken as vectors rather 
than scalars. This generalization is needed in order for the resulting algorithm to be useful 
in the design of robust filters in Secs. 45.1 and 45.3. 


Derivation of the Algorithm 
So consider the quadratic cost function 


J(w) = w'IIw + (y — Hw)'W(y — Hw) (44.8) 


with matrices (II, W satisfying (44.6)-(44.7). We already know from Thm. 44.1 that, 
under condition (44.7), the unique minimizing solution of J (w) is given by 


@ = (I+ H*WH)-'H*Wy (44.9) 


We partition the vector y into its individual components, y = col(zo, 21, 22,... , ZN 1]; 
where each {z;} could be a scalar or a vector, say, of dimension p. In the former case, y is 
N x 1, while in the latter case, y is Np x 1. The reason why we allow for vector entries 
(25) in y is because in our study of robust filters later in Secs. 45.1 and 45.3, the entries of 
y will need to be vectors. i 

Likewise, we partition H into 


H = col(Uo, U1, U2,...,Un-1} (44.10) 


where each U; is p x M. When p = 1, the (U;) become rows. We further assume that the 
weighting matrix W has a block diagonal structure, with p x p invertible diagonal entries, 


W = diag{Ro, Rs,..., Rui) (44.11) 


In recursive minimization, we deal with the issue of increasing N. Therefore, in order 
to indicate the dependency of the solution on N, we shall denote the vector ®© in (44.9) by 
"UN; this notation is meant to indicate that 4? is based on data up to time N — 1. We 


shall also write (yy 1, Hy-1, Wn-1, Jn-1(-)} instead of (y, H, W, J(-)}, so that the 
cost function (44.8) becomes 


Jn-1(w) = w*Ilw + (yn-1 — Hn-1w)*Wn-i(yn-1 — Hy-iw) 


and, from (44.9), its minimizing solution is 


wn-1 = (I+ Hy WN-iHy-1) 1 Hy WN-1YN-1 (44.12) 


provided that condition (44.7) holds, i.e., 


+ AX_,Wn-1HNn-1 >0 (44,13) 


Now suppose that one more (block) row is added to Hy 1, one more (vector) entry is 
added to yy_1, and one more invertible weighting block is added to Wy —1 leading to 


A | UN-1 _ | Hy _ | Wya 
wef os | ity = | Uy | Wy =| à] (44.14) 


Then the minimizing solution of the extended cost function 


Ju (w) 2 wh + (yn — Hnw)*Wn(yn — Hyw) 


is given by 
wy = (I+ H4 Wyn Hy) | HNWNyN (44.15) 
provided that 
Il+ HyWy Hy > 0 (44.16) 


Our objective is to relate wy to wy_1, as well as relate the existence conditions (44.13) 
and (44.16). 
To do so, we start by introducing 


Py 2 (ILRHR&WyWHN)? with P = 07 (44.17) 


Then the existence conditions (44.13) and (44.16) can be restated in terms of (Py, Py—1} 
as 
Py 50 and Py_,>0 (44.18) 


In addition, the solutions (wy, wy -—1 } can be expressed more compactly as 
wn = PyHnwnyn, wn-1 = Py-1Hy_1Wn-iyn-1 (44.19) 
Now note that the inverse of Py satisfies 


Po = Il4- HENWwHxy = Il+ Hy- Wn- HN- TUSRNUN 
e 


-1 
Pri 


Py! = Pyl, + Uf RnUn (44.20) 


Using the matrix inversion identity 


(A--BCD)!-A-AO3B(C!-^-DA^?B)!D4^! (44.21) 
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Py = Py-1— Pn-i1UxTNUNPn-1, Pi= Uu (44.22) 
where 
Ty Ê (Ry! UyPy AUR)? (44.23) 


The above recursion for Py also gives one for updating wy_1 to wy. Using the expression 
(44.19) for wy, and substituting the recursion for Py into it, we obtain: 


wy = Py [Hy-1Wny-1yn-1 + Us RyzN] 

= (Py-1 — Py-1UNT NUN Py i) [Hy AWN -iyN-i + Us Rew] 

= Py-1Hy_;Wn-1yn-1 —Pn-1UNT NUN Py-1HN_\Wn-1yn-1 + 
SNNT CER PUSSY 
Py-iUn (Rn — YNUN PN AUN RN) zy 

= uwx-i- PN AUNT nUnwy-1 PN AUNEN (Uy Ry — Un Pu AUN RN) zw 

I 
= ww-i- PN-iUNUN[zy — Unwy-1] 
= wy-1+Gyl[zn —Unwn-i] 


where we are defining the gain matrix 
Gy Ê Py-URTw (44.24) 
In summary, we arrive at the following statement. Observe that when p — 1, so that the 


(25) are scalars and the (U;) are row vectors, and when Wy = I, the recursions stated 
below have the same form as the RLS solution of Alg. 30.1. 


Algorithm 44.1 (Recursive minimization algorithm) Consider an invertible 
Hermitian matrix II, and an invertible block diagonal weighting matrix Wy = 
diag( R;) with p x p block entries. The solution wy that minimizes the cost 


JN (w) = w*Hw + (yy — Hyw)*Wn (yn — Hyw) 


can be computed recursively as follows. Start with w~1 = 0 and P. = II^! 
and iterate for i > 0: 


Ty = (RDCTUBRAGUD! 
Gi = P_,U;T; 

Wi-1 + Gillz: — Uiwi-i] 
P, = Pii- GrG} 


& 
i 


For each iteration 0 <i € N, the vector wj minimizes the cost 


Ji(w) = w'Ilw + (yi - Hiw)'Wi(yi — Hiw) 


if, and only if, P; > 0. 


Alternative Form 


We can derive two alternative expressions for the quantities {G.n+1,'v+1} in (44.23) and 
(44.24). Multiplying the recursion for Py by U}, from the right gives 


PyUX = Py-1Uy — Py-1UR EP nUn PN AUR 
= Py_UXTy (Px! - Un Py-1UN] 
= Py-,UXTn [Ry + Un Py AUS — Un Py AUR] 
= PNGUNENHN. (44.25) 


which leads to the following expression for Gy: 


Gy = PuUs Ry (44.26) 


By further multiplying the identity (44.25) by Uy from the left we get 
Uy PNUX RN = UN PN-1UNYN 
But, from the definition (44.23) for T y, 
Ux Py iU = Ty - Ry 


so that substituting into the previous equation we obtain the following alternative expres- 


sions for T y: 
Iw = Ry — RNUN PNUS Ry (44.27) 


Expressions (44.26) and (44.27) are in terms of Py rather than Py -1, so that the recursion 
for wy from Alg. 44.1 can be rewritten as 


WN = WN-ic PnU% Ry [zn — Ux wa] (44.28) 


44.3 TIME-UPDATE OF THE MINIMUM COST 


In addition to relating the solution vectors {wy , wy. 1), we can also relate the correspond- 
ing minimum costs, (Ju (wx), Jv-1(wn-1)}. The argument we employ here is similar 
to the one that led to Lemma 30.1 in the RLS case. Thus, let £(N) = Jy (wy) denote the 
minimum cost at time N. Let also (ew,ry) denote the p x 1 a priori and a posteriori 
error vectors defined by 


(4429) 
Then from Thm. 44.1 and Alg. 44.1 we have 
E(N) 2 yNWuluy —Hnwn] and wy =wn-1+Gnen 
so that, using the partitioning (44.14) for (yy, Hw, Ww), we can rework the above ex- 
pression for £(N) and show that either of the following two expressions can be used to 


time-update the minimum cost: 


E(N) = €(N-1) + ry RuTN (44.30) 
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E(N) = E(N — 1) + ej nen (44.31) 


Indeed, we have 


E(N) 


yN-1 — Hn-1(wn-1 + Gnen] 
zn — Un[wn-1 + Guen] 


[ yn-1Wn-1 zZN RN | | 


| weWw-r 2h Rw | 


(yN-1— Hn-1wn-1) - Hn-1Gnen 
= USGN)es 


= &(N-1) - yy-1Wn-1Hn-1Gnen + zy Rn [I — UnGnlen 
= €(N-1)+[2nRn —ynWnHnGylen 


= €(N-1)+] zyRn — yNWN Hx PN Ux Ry | en 
— 
UN 
= €(N-1)+(zn — wuNUN) Rnen 
= €(N-—1)+ryRnen 


which establishes (44.30). In order to derive (44.31), we relate ry and ey as follows: 


i 


[I = UnGwnlen 

(I = Un Py -1UNU nlen 
[FX — Un PN -1UR]E ven 
Ry'T wen 


so that rẹ Rven = eX ven. Substituting into (44.30) leads to (44.31). 


TN = Zy —UnNWN = ZN — Uv |wn-1 T Gnen] 


That is, 


Lemma 44.1 (Estimation errors) Consider the same setting of Alg. 44.1. At 
each iteration i, the a priori and a posteriori errors defined by e; = z; — U;wi- 
and r; = z; — Ujw,, are related via r; = Ry Tie. In addition, the minimum 
cost, £(i) = J;(w;), can be time-updated via either recursion: 


i) = &(— 1) e ri Rie; = 6(— 1) + e;Tiei 


Using the initial condition £(—1) = 0, it also follows that the minimum cost 
at time N is given by 


N 
E(N) = V efTie; 


44.A SINGULAR WEIGHTING MATRICES 


Although the derivations so far assumed invertible matrices (II, W ), with a block diago- 
nal W, the results can be extended in order to accommodate a possibly singular W. We 


illustrate this possibility by considering one particular special case, which will arise in our 
study of robust filters in Sec. 45.3. 

Assume all matrices R; in (44.11) are invertible for 0 < i < N —1, while Ry in (44.14) 
is singular, say, of the particular form 


av=[4* ,] 


with an invertible leading submatrix Ay. In other words, assume Wy has the form 


Wn-1 
Wn = An 
0 


with invertible {Wy_1, Aw}. The question is how does the new minimizing solution wy 
relate to wy -1, and how do the minimum costs relate to each other? 
To answer these questions, we partition the entries (Ux, zw) in (44.14) accordingly 


with Ry, say, 
wela o x 


Then, by repeating the derivation of Sec. 44.2, it is easy to see that the recursions of 
Alg. 44.1 still hold for the time instants 0 < i < N — 1, i.e., it still holds that 


DL = (RS UBAGUD! 
Pj 1U?Ti 

wii Gi[zi — Uiwi-i] 
P, = Ba-Gre 


Q 
l 


g 
I 


while the iteration for going from time N — 1 to time N is now given by 


Ty = (Ay +LnPy-1Ly)™ 
Gw = Py bey 
wn = wn-1+Gn([sw — Lywn-1] 


Py = Pu-i- Gyl'x'Gh 


with zy replaced by sy, Ry replaced by Ay, and Uy replaced by Ly. Moreover, for any 
0 <i € N, it still holds that w; is a minimum of the corresponding cost function J;(w) 
if, and only if, P; > 0. In particular, wy minimizes Jy (w) if, and only if, Py > 0. But 
since now 

Px! = Pht + LyAnLy 


we find that the condition at time NV can be equivalently stated as requiring 


Py, +LyAnLy > 0 (44.33) 


In addition, the minimum value of Jy (w) is given by either of the following expressions: 


N 


N 
E(N) = X eiei = Sori Ries (44.34) 
i=0 


i=0 
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where 
ei = z; — Uiwi-i for O<i<N-1 
Tj = 2 — UiWi for O<i<N-1 
en = Sy — Lywn-1 
TN = SN —Lywn 


(44.35) 


We are now in a position to use the theory of indefinite least-squares, as developed so far, 
to design robust filters. 


44.A APPENDIX: STATIONARY POINTS 


Let J(w) denote a function of w that is not necessarily quadratic. 


Definition 44.1 (Stationary points) A stationary point (also called a critical point) 
of J(w) is defined as any ® at which the gradient vector is annihilated: 


@ is a stationary point — > VyJ(w)|,.g = 0 


Moreover, the following facts hold: 
1. A stationary point @ is a local minimum if V2,[J(@)] > 0. 
2. A stationary point @ is a local maximum if V2,[J(@)] < 0. 


3. A stationary point d) is a saddle point if V2, [J(®)] is indefinite and invertible. 
In this case, the behavior of the function at the saddle point looks similar to 
the shape of a saddle (hence, the name). 

4. If the Hessian matrix is singular, no conclusion can be drawn about the 
stationary point (i.e., whether it is a minimum, a maximum, or a saddle 
point). Further analysis is required. 


The discussion in Sec. 44.1 shows that when J(w) is quadratic in w, as in (44.1), then more can be 
said about its stationary points. 


Definition 44.2 (Stationary points of quadratic functions) When the cost func- 
tion J(w) is quadratic in w, the following facts hold: 


1. A stationary point @ is a global minimum if, and only if, V2,[J(@)] > 0. 


2. A stationary point @ is a global maximum if V2[J(i2)] < 0. 


3. Otherwise, a stationary point @ is a saddle point. 


44.B APPENDIX: INERTIA CONDITIONS 


We stated in Alg. 44.1 that each wy will be a minimum of Jy (w) only if the corresponding matrix 
Py is positive-definite. This positivity condition needs to be checked at each iteration in order to en- 
sure that the successive minimization problems are well-defined. Checking the positive-definiteness 
of Py, for every N, is computationally demanding but simplifications are possible as we now verify 
by induction. 

Assume P; > 0 for0 € à € N — 1 and let us devise a simplified condition for checking whether 
Pn > 0. We start by recalling that Py = Py—1 — Py -1UNT NUN Pn-1, which shows that Pv 
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with respect to its lower-right corner entry. This observation can be used as the basis for an alternative 
method for checking the positive-definiteness of the matrices {Py—1, Pn ). To see this, we factor X 
in two ways: 


x 
I 


I Gn Py I 
44. 
| I is eoa ARM) 


I Pa Loe 
Is] w]b T] wm 


Each of the triangular factorizations (44.36) and (44.37) has the form of a congruence relation. In 
other words, each one of them has the form X = CTC" for some nonsingular matrix C and center 
matrix T". 

Now recall that Sylvester's law of inertia (cf. Lemma B.5) states that the inertia of a Hermitian 
matrix is preserved under congruence, where the inertia of a Hermitian matrix T' is defined as the 
triplet of integers {7+ (T), I- (T), Io(T)), such that 


K+,- (T) 2 numberof (positive,negative,zero) eigenvalues of T (44.38) 


Applying this result to the factorizations (44.36) and (44.37) we find that the matrices diag ( Pw, TX) 
and diag( Py 1, Rọ ) should have the same inertia, or more explicitly, 


I (Pn eT 5’) I,(Py-i 6 Ry’) 
I-(Ps eT) = I-(Py-1 © Ry’) 


(44.39) 


where the notation a $b denotes a block diagonal matrix with entries (a, b}, i.e., ab = diag{a, b]. 
Using the rather obvious fact that the inertia of a block diagonal matrix is the sum of the inertias of 
its individual diagonal blocks, it follows from (44.39) that the following inertia equalities must hold: 


L.(Py) + LTR) = (Pri) + IG. 
I-(Py) + I-(Uy') = I-(Py-i) + I- (Ry) 


From this result, and from Py. > 0, we find that Py > 0 will hold if, and only if, 
LTR) = L(Ry) an — L(Ty) = I-(Ry)) 


In other words, since a matrix and its inverse have the same inertia, we find that Py_1 > 0 and 
Py > 0 will hold if, and only if, the matrices 


(Tw,Rw) have the same inertia (44.40) 


Lemma 44.2 (Minimization conditions) Consider the same setting of Alg. 44.1. 
Then each w; is a minimizer of the corresponding cost function Ji(-), for i = 


0,1,..., N, if, and only if, P; > 0 for i = 0,1,..., N or, equivalently, if and only 
if, (Di, Ri} have the same inertia for i = 0,1,..., N. 


The second condition in the lemma is easier to check since the matrices (I';) are smaller in size 
than the ( P;); the former is p x p while the latter is M x M and usually p « M. For example, in 
the special case p = 1, we have that P; is M x M while (T';, Ri} are scalars. In the scalar case, 
requiring the (I';, R;) to have the same inertia is equivalent to requiring them to have the same sign. 


— is] 


Robust Adaptive Filters 


W. now formulate a robust design criterion and proceed to devise adaptive filters that 
meet the desired robustness performance. The derivations and arguments in this chapter 
are similar in nature to the arguments we employed in Chapters 29 and 30 while study- 
ing recursive least-squares problems. The main distinction will be in the use of quadratic 
cost functions with indefinite weighting matrices, as opposed to positive-definite weights. 
A key conclusion from the discussion here will be that some of the adaptive filters that 
we encountered before, e.g., LMS and e—NLMS, and which were derived in Chapters 10 
and 11 by appealing to stochastic-gradient approximations, will now be shown to satisfy 
the adopted robustness measure. Actually, the arguments in the current chapter will al- 
low us to motivate and derive these algorithms as optimal, as opposed to approximate, 
recursive solutions to well-defined optimization problems, in much the same way as the 
RLS algorithm was derived in Chapter 30 as the optimal recursive solution to a regularized 
least-squares problem. 


45.1 A POSTERIORI-BASED ROBUST FILTERS 


Thus, consider measurements (d(1)) that are related to an unknown vector w° via 
d(i) = ww? + v(i) (45.1) 


where v(i) denotes an unknown disturbance and u; is a row vector. The disturbance se- 
quence is assumed to have finite energy, so that 


MG «o forali (45.2) 
j=0 


Given the (d(-)), we are interested in estimating some linear transformation of w°, say, 


Liw? (45.3) 


for some given matrices L;. For example, if L; is chosen as the identity matrix, L; — I, 
then the problem amounts to that of estimating w? itself. If, on the other hand, L; — 
ui, then the problem amounts to estimating u;w^, which is the uncorrupted part of d(i). 
Different choices for L; lead to different interpretations for s;. The derivation that follows 
is for arbitrary L;. 

Clearly, whenever we say that the observations {d(i)} satisfy a model of the form (45.1), 
there is the implicit assumption that such a w? exists and, furthermore, that we know its 
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dimension. However, it is not hard to imagine that the assumption need not be valid in 
general and, therefore, modeling mismatches are bound to occur. Consider, for example, a 
first-order auto-regressive model with transfer function 


1 
H = — i 
(z) Ia with a real and |o| < 1 
so that 
H(z) = 1 +027! +a??? +a... (45.4) 


Assume further that we feed a sequence {u(i)} into this model and measure its output 
sequence, say, {o(i)}, in the presence of additive noise {n(i)}, i.e., assume we measure 


d(i) = o(i) + n(i) 


If we choose some FIR model of order M to approximate H(z), then we are in effect 
assuming that o(i) ~ u;w? where, for this example, 


u= [| u(i) ui-1) .. wi—-M-1)] 


and w° could consist of the first M coefficients of the expansion (45.4), i.e., 


v? = col(1,0,02,..., 9M-71) 


The mismatch between o(1) and u;w? is due to the terms that are ignored from the expan- 
sion (45.4). If we incorporate the effect of these ignored terms into n(i), then we end up 
with a relation of the same form as in (45.1), i.e., 


d(i) = ww? + v(i) 


with the term v(i) accounting for both measurement noise and modeling uncertainties or 
errors. It is for this reason that we sometimes state that (i) in the model (45.1) represents 
not only measurement noise but also model uncertainties. 

Besides model uncertainties, a second source of uncertainty that is relevant in the con- 
text of adaptive filtering pertains to the choice of the initial condition w_, of an adaptive 
algorithm. The value of w_, can interfere with the performance of the filter, e.g., it may 
delay convergence or even cause divergence. Usually, we choose w_; = 0 and, therefore, 
the squared Euclidean norm of w^, ||w°||?, serves as a measure of how far this initial guess 
is from w°. In our formulation of robust filters we shall also account for the effect of the 
initial condition on filter performance. 


Robustness Criterion 


Now let $;,; denote an estimate for s; that is based causally on the measurements (a(3)). 
i.e., Sij; can only depend on (d(0), d(1),..., d(1)) and not on any future value of d(-). Our 
objective is to compute estimates {3;);} that are robust according to the following criterion. 

Let y be a given positive number and let II be a given positive-definite matrix. We want 
to determine estimates 


{So}o, $11) $2\2,--- Syn} 
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L I5; -sjl? 
j= 


< $4? forall i=0,1,...,N (45.5) 


wwe + 2 v)? 
J=! 


This requirement has the following interpretation. For every time instant i, the numerator 
is a measure of the estimation-error energy up to time 7, while the denominator consists of 
two terms: 


1. One term is the energy of the disturbance, v(j), over the same period of time. 


2. The other term, w*°IIw®, is the weighted energy of the error in estimating w° by 
using, for the lack of any other information, a zero initial guess for it. That is, 
w?*Ilw? = ij*IIà) where i = w — w = w°. 


We therefore say that the denominator in (45.5) is a measure of the energies of the distur- 
bances in the problem, namely, (v(-), ©}, while the numerator is a measure of the resulting 
estimation error energies. In this way, the criterion (45.5) requires that we determine es- 
timates {3;);} such that the ratio of energies, from disturbances to estimation errors, does 
not exceed ^? for any (v(-), w°}. When this property holds, we shall say that the result- 
ing estimates are robust in the sense that bounded disturbance energies lead to bounded 
estimation-error energies and, in a similar vein, small disturbance energies lead to small 
estimation-error energies. Of course, the smaller the value of y, the more robust the solu- 
tion is.? Observe further that no statistical assumptions on the data are being made. In this 
way, the robustness studies that are carried out in this chapter allow us to highlight some 
features of adaptive algorithms that hold regardless of statistical considerations. 


Relation to Indefinite Least-Squares 
The condition (45.5) can be equivalently stated as requiring, fori = 0,1,..., N, 


i i 
ww + So WO)? — Y? I$; — sill? > 0 
j=0 jz0 


no matter what the value of the unknown w° is, as well as the value of the disturbance 
sequence {v(j)}. Using the relation 


d(j) = u;w? + v(j) 
we can equivalently rewrite the above requirement as 


L a, "D 2 $2. 
eos (| tale) i" ad] 


(45.6) 


23 However, the value of ~y cannot be reduced at will by the designer. As we are going to see, its value needs to 
meet certain existence conditions (see, e.g., (45.15)). 


regardless of w^. In other words, the problem of determining estimates (3;,;) in order to 
satisfy (45.5), for any w? and for any {v(-)} satisfying (45.2), is analogous to the problem 
of determining ($;,; } in order to satisfy (45.6) for any w°. 

If we now introduce the vector and matrix quantities 


Solo Lo 

EUN wo 

" EL " Li 
y £ | 41) |, H |u (45.7) 

Si Li 

d(i) ui 


as well as the block-diagonal weighting matrix 


semp depu em 


then problem (45.6) amounts to determining estimates (8;|;) that guarantee J;(w) > 0 for 
i — 0,1,..., N and for any w, where the cost function J;(w) is defined by 


Ji(w) = w'Ilw + (yi - Hiw)' Wi(yi — Hiw) (45.9) 


and where we are denoting the indeterminate variable more generally by w instead of w?. 

The cost function J;(w) so defined is quadratic in w and has an indefinite weighting 
matrix W;; it has the same form as the cost function J (w) studied in (44.1). Therefore, the 
results developed in Secs. 44.1—44.2 are immediately applicable. Specifically, in order for 
the quadratic function J;(w) to be positive for all w, it is necessary and sufficient that both 
of the following conditions hold: 


1. J;(w) should have a minimum with respect to w. 


2. The estimates (3;;) should be chosen such that the value of J;(w) at its minimum 
is positive. 


Solving the Minimization Step 

Let us first show how J;(w) can be guaranteed to have a minimum for each i = 0,1,..., N, 
and, in addition, how to update the minimizing solutions. Comparing expression (45.9) for 
J;(w) with the expression for J;(w) from Alg. 44.1, we can make the following identifi- 
cations: 


a | m]. ue | |, R— |" i (45.10) 


so that the algorithm that updates the successive minima is given by the following equa- 
tions. Start with w., = 0 and P_; = II^! and iterate 


— zi 
Dc (I TE SEPIESE: « 1) (45.11) 
Gi = Pea | G at (45.12) 
"x (| Sie | | Es uy, 
Wi; = w-A1+G; (| dli) | | i jui) (45.13) 


P, = Pi- GI7'G} (45.14) 
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Each w; is a minimum of the corresponding cost function J;(w) if, and only if, P; > 0, or, 
equivalently, 


9 
T; and | TI M have the same inertia. (45.15) 


Enforcing Positivity 
We still need to show how to determine estimates {3;); } such that the values of the suc- 
cessive J;(w) are positive at their minima. We pursue the construction of the {3;);} by 


induction. So assume that estimates (ojo, ..., $; 1]; 1) have been chosen such that the 
values of the corresponding cost functions (Jo, J1, . . . , Ji-1} are positive at their respec- 
tive minima: 

Jo(wo) > 0, Ji(wi) >0, ..., Ji-i(wi-i1) > 0 


Using the result of Lemma 44.1, we know that 
Ji(wi) = Ji-i(wi-1) + eiliei 


where, in view of the identifications (45.10), 


SICH 


But since J;_1(wi-1) > 0, we find that in order to guarantee J;(w;) > 0 it is sufficient to 
choose S;;; such that the term below is positive, 


e;lie; > 0 (45.16) 
This construction provides one possibility for enforcing J;(w;) > 0, and it will lead to 


what is known as the central solution. Other constructions for 3,); are possible but are less 
immediate. We now explain how (45.16) can be achieved. For this purpose, we partition 


the vector e; as 
= | Se |_| |, | A | eo 
| ais | E: EE E 


where its bottom entry is a scalar, i.e., 

e(i) 2 d(i) — uiwi-1 
and its top entry is dependent on $;j;, i.e., 

esi = Sii — Liwi-i 


Using the defining expression (45.11) for TI';, condition (45.16) is equivalent to choosing 
Sj; such that 


—2 P. * g2 * x 
| es; e*(i) | Titan ee Pw | i >0 (45.17) 


. 8, 
id ui P; iL; 1+ ui P; -iu; e( 


i 
) 
The expression on the left-hand side of (45.17) is quadratic in 3,);; it can be rewritten as 
the sum of two squares by resorting to a completion-of-squares argument. 


Introduce the upper-diagonal-lower factorization of the inverse of the central matrix in 
(45.17), namely, 


A2 .D. * . P. * 
zd d n. UE DR ; 0 uBal o 
- o DT | 0 i-cuBPi iu | TWP 1 
where 
AS(-yIELPSIIL)-LP-npücuPB2ow) uBob ^ (4519) 


Although unnecessary for our present argument, the matrix A can be shown to be negative- 
definite — see Prob. X1.2. Inverting both sides of (45.18) we obtain a similar, albeit lower- 
diagonal-upper, factorization for the center matrix in (45.17), 


-1 
A | Plt Li Rail} LPi} 
r, ê | "wo fearon (45.20) 
I 0 wi I 0]* 
-| uPabi 4 | p. i h A | | wBab 
1+ uwPiiu; (1+ uPi-iur) lc uPjiu; 
Substituting into (45.17) leads to 
"(Gu DT A^ 0 .. LiPi-iute(i) 
eu SONDA eg 1 ^i ikuBow |»0 
TERES lc uiu e(i) 
(45.21) 
Now since 
(1+ uPj-1u;) > 0 
(in view of P;_; > 0), the positivity condition can be met by setting 
LiPj-iul 
64 — e EE. = 0 (45.22) 


lc uPiiu; 


or, equivalently, using es; = Sij — Lywi-i, 


ms LiPi-iwi , 
Sy = Liwi t+ Pelli) — WiWi-1] 
e ves dun. Pui TM 
= Li los t T+ uP (d(i) ui) 


This choice for ;); allows us to simplify the recursion for w; in (45.13). Indeed, substitut- 
ing the expression for es, from (45.22) into (45.13) leads to 


LiPi-iuj 
wi = vi-i t Gi SUM e(i) 
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Using the factorization (45.20) for T; in the expression (45.12) for G; gives 


L,Pi-yu; 
idil b * 
G| rz cer |- eek S 
t * 
1 1+ uPiiu; 


so that the recursion for w; reduces to the form shown in the statement below.Comparing 
with the above expression for 3,); we conclude that Sj; can be written more compactly as 
Sii = Liwi. 


Algorithm 45.1 (A posteriori robust filtering) Consider data (d(i),u;] that 
satisfy the model d(i) = uw? + v(i), for some unknown vector w° and fi- 
nite energy noise sequence (v(i)). Consider further a positive scalar ^; and a 
positive-definite matrix II. Let s; = L;w? denote some linear transformation 
of w? that we wish to estimate causally from the {d(j)} such that 


i 

D li - 3;||? 

— 5 < 7 forall i= 0,1,...,N (45.23) 

w^lw? + 5j)? 
j=0 


where Sjj; denotes an estimate for s; using the data {d(0),d(1),...,d(y)}. 
One construction for the desired estimates can be computed as follows. Start 
with w_1 = 0 and P_; = II! and iterate 


Piu} 


VrwBosp i) c wd 


Wi = Wit 


Sj; = Liwi 


E ‘hee f+ [Baal a wl) 


Gi = Pig | Le ak 0; 


1 


Pi = Pi- -Gry 'G} 


This solution satisfies (45.23) if, and only if, the {P;} are positive-definite, 
or, equivalently, if and only if, T; and diag{—y7I, 1) have the same inertia 
forü0 € i € N. 


It turns out that some of the adaptive algorithms that we encountered earlier in Chap- 
ter 10, and which were motivated there as stochastic-gradient methods, can now be re- 
examined in light of the robust solution of Alg. 45.1 with some useful conclusions. 


45.2 «-NLMS ALGORITHM 


Let L; = uj so that s; = u;w?, i.e., s; is now a scalar and we rewrite it as s(;). This choice 
of L; corresponds to a situation in which we are interested in estimating the undisturbed 


part of d(i). In this case, the expressions for I'; and G; from Alg. 45.1 become 


([^7 ft | et als u y 


]n 


Ti 


Gi = Pii pat ux 


so that the recursion for P; becomes 


ES 
—42 ; : 
Pi = Pia- Pil ut (| i Jele Jelu uf J) MEE 
i ui 


This recursion can be simplified. Inverting both sides and using the matrix inversion lemma 
(44.21) we find that 


or, equivalently, 
(45.24) 


Inverting both sides of (45.24) allows us to re-express the recursion for P; in the form 


BN Py-ujusPi-1 
R= Ba- Toyi uP 
Observe from (45.24) that starting from PT} = II > 0, all successive P; ! will be positive- 
definite for any y > 1. In other words, the robust filtering problem that corresponds to the 
choice L; = u; is guaranteed to have a solution for any y > 1. But can it have a solution 
for y < 1? The answer is in general negative. 

To see this, assume that the regressors (u;) are sufficiently exciting, i.e., 


N 
Jim ( 2, vit = oo (45.25) 


This is a mild condition on the regressors since it amounts to requiring them not to “vanish 
quickly”. Now iterating (45.24) gives 


i 
P` =H + (1-7?) uw (45.26) 
k=0 


For any ^ < 1, and because of (45.25), it is easy to conclude that for sufficiently large i, at 
least one diagonal element of the matrix 


i 
II + (1- Y?) ukur 
k=0 


must become negative. This fact violates the required positive-definiteness of P; so that 
y < lis not possible when the regressors are sufficiently exciting. In Prob. XL.4 it is shown 
that, when the regressors do not satisfy (45.25), choices of y < 1 are possible. 
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Assume now we set y = 1 and choose II = el, for some small e > 0. Then recursion 
(45.24) implies that 
P =P} forall i 


and, hence, P; = e711. In this case, the recursions of Alg. 45.1 collapse to the e-NLMS 
algorithm with unit step-size (y = 1 — see Alg. 11.1). 


Algorithm 45.2 (c- NLMS algorithm) Consider data (d(i), u;) that satisfy the 
model d(i) = u;w°+u(i), for some unknown vector w° and finite energy noise 
sequence (v(2)). Let s(i) = u;w? denote the uncorrupted part of d(i) and 
assume we wish to estimate s(i) causally from the {d(j)} such that 


i 
Y IsGU) - sG)P 
a —RQL.— — «1 forall i=0,1,...,N 
elwe]? + 2. lu)? 
j= 


where 8(j|7) denotes an estimate for s(j) using the data {d(0), d(1), .. ., d(7)]. 
One construction for the desired estimates can be computed as follows. Start 
with w_ ; = 0 and iterate 


* 


e (d(i) = UiWi-1] 


Wi = Wiic 


S(ili) = UUW; 


When the regressors (u;) are sufficiently exciting, as in (45.25), then it can be verified 
that that e-NLMS, with unit step-size, is a solution to the following min-max problem over 
all finite-energy noise sequences (see Prob. XI.5): 


oo 
D ISGl;) - wwe 
Yoot it sup | ——— B3 ——— n 
BGD {we} | ewel? + E [v(j)? 
j=0 


In other words, the recursions of the algorithm minimize, through the choice of the esti- 
mates {3(j|7)}, the largest possible value of the energy gain from the disturbance sequence 
(w?, u(j)} to the error sequence {5(j|7)} and, moreover, 72, = 1. 

We therefore find that €- NLMS, which was derived in Chapter 10 by appealing to 
stochastic approximations, is in fact an optimal recursive solution to the optimization prob- 
lem (45.27), in much the same way as RLS is the optimal recursive solution to the regular- 
ized least-squares problem. 


45.3 A PRIORI-BASED ROBUST FILTERS 


In the robust formulation of Sec. 45.1, the estimate $;;; was required to depend on mea- 
surements (d(j)) up to and including time i. In this section, we shall show how to solve a 
similar problem where the estimate of s; = L;w? is now required to depend on the (d(5)) 


in a strictly causal manner. Specifically, we shall show how to determine an estimate 3; ;..; 
that is based on the data {d(j)} up to and including time i — 1; hence, the notation Sjj;..; 
as opposed to $;j;. 

Let again (v(i)) denote any disturbance sequence with finite energy as in (45.2). We 
wish to determine estimates 


{So-1, Silos 825, tee »8n|n-1} 
at times i = 0,1,..., N such that, regardless of the disturbance sequence {v(i)} and the 


unknown u^, the estimates should satisfy the performance measure: 


D l5-1 = s;|? 
cim < $4? forall i=0,1,...,N (45.28) 


ii 
w^*Ilw? + 5;|v(j)P? 
j=0 


Comparing with (45.5), we observe that the upper limit on the sum involving v(i) is now 
i — 1 rather than i. The condition (45.28) can be equivalently stated as requiring for i = 
0,1,..., N, 


i-l 1 
wlw + V vj) — 47? $2 Gy- sll? > 0 (45.29) 
j==0 j=0 


or, equivalently, by using the relation d(j) = uj;w° + v(j), 


ww — y’ l3; 1 — sil (45.30) 


v^ (| Sui L; |y) [v Sjlj—1 Lj | e 
XU) - Spo (Pa ]- p 


no matter what w° is. In other words, the problem of determining estimates (8j; ) in 
order to satisfy (45.28), for any w? and for any disturbance sequence {v(j)} satisfying 
(45.2), is equivalent to the problem of determining the (3ji j-1} in order to satisfy (45.30) 
for any w°. 

Observe that the index for the sum involving Sji j—1 in (45.29) and (45.30) runs from o 
up to 7, while the index for the sum involving v(7) (and, consequently, d(j)) runs from 0 
to à — 1. In order to treat both sums uniformly, with the indices running from 0 to 7 in both 
cases, we define the weighting matrix 


seem dol od 


(45.31) 


with a last block entry that is singular. Likewise, we define the vector and matrix quantities 


So|-1 Lo 
d(0) uo 
Silo Li 
y $|42)0]|, mHBj|w (45.32) 
Sii-i Li 


d(i) ui 
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CHAPTER 45 Ji(w) > 0 fori — 0,1,..., N and for any w, where the cost function J;(w) is defined by 
ADAPTIVE z 

FILTERS Ji(w) = w'IIw + (yi - Hiw)' Wi(yi - Hiw) 


and where we are now denoting the indeterminate variable more generally by w. 

Since the last block entry of the weighting matrix W; is singular, while all other blocks 
in it are invertible, we are therefore faced with the situation discussed in Sec. 44.4. Us- 
ing the results of that section, with i playing the role of N, we find that the minimizing 
argument w; of J;(w) can be determined recursively as follows. Start with w—ı = 0 and 
P_, = II^! and iterate forO € j < i — 1: 


—? L . zi 
E (| 7 Je E fealty uj i) (45.33) 
G; = Pea [ Li uj T; (45.34) 
se mes ud: {| 8-1 |_| Ei |. 
w = vase a -[ fem) R 
P; = Pija- GIG (45.36) 
At iteration i, the vector w; is found via 
G = C74 beat) (45.37) 
Gi = Palir: (45.38) 
wi = wi- + Gi (Siu — Liwi-1) (45.39) 
P, = Pi- GI G} (45.40) 
The resulting w; minimizes J;(w) if, and only if, 
PZ} -yL Li > 0 (45.41) 


Moreover, the resulting minimum cost at time i is given by 


il 
Ji(wi) = Y Tjej + e;liei 
j-0 
where 
e; = [35 iE jus & | esj | fo 0<j<i-1 
; d(j) uj e) dod 
P. ^ 
& = Si- liwii = es 


In other words, by using the expressions for {I';, I; }, 


Ji(wi) = Si e; €] (| E 1 | x | » J^ ER i) | ol) | 


j-0 1 


3j 
-teu(-YI-LPROL) es 


This expression for the minimum cost can be simplified as follows. Introduce the upper- 
diagonal-lower factorization for I';: 


zl 
—YH-L;jPjib ^ LjPjniwj 2 | I -CYIERL;PjL))^!L;Pj-iuj 
uj Pj-iLi lctujPjiuj 0 1 


—6143 L;P;aGL'*)y 
| (=y I + £5 P5135) B | (45.42) 
A; 


I 0 
| -ujPj-1Li(—Y?I- LjPj-iL5) | 1 | 


where 
Aj 5 1+ uj;Pj-iuj - ujPj- ALI ( YT Lj Pj íAL;) Lj Pj- 
1-4 Uj [Pj-1 — Pj-1L*(-771 + L,Pj-iLj) ! Ly P;-1] uj 
= l-tuj(Pi-wv?LjL;)'w 
2 4 4 uj; iu? 
and 


(45.43) 


We know from condition (45.41) that Ba is positive-definite, so that A; itself is positive- 
definite. Then we can write 


i i-1 
Jus) = Y et; (—P7 + LPL) ess + Y 4G) - dGY' Aj? a) - da) 
j=0 j=0 


where we introduced 
ya A » NE 
d(j) = ujwj-i + ujPj-iLj (—7I + LjPj-iLj) le,j 
It is now clear that J;(w;) can be made positive by choosing e;,; = 0 or, equivalently, 


$jj-1i-— Ljwj-i 


Substituting into recursion (45.35) for w; we get 


wj 


Il 
& 
l 
m 
+ 
n 
D 
Sole 
&, 
e. 
r——3 
e 
p 


P, ut i 
— 1 — e(j) 
1+ ujpPy_1u7 


where the second equality follows by using the factorization (45.42) for T}. On the other 
hand, from (45.39) we get w; = wi_1. We therefore arrive at the following statement. 


729 


SECTION 45.3 
A PRIORI- 
BASED 
ROBUST 
FILTERS 


730 


CHAPTER 45 
ROBUST 
ADAPTIVE 
FILTERS 


Algorithm 45.3 (A priori robust filtering) Consider data (d(i),u;) that sat- 
isfy the model d(i) = uw? + v(i), for some unknown vector w° and finite 
energy noise sequence {v(i)}. Consider further a positive scalar y and a 
positive-definite matrix II. Let s; = L;w? denote some linear transformation 
of w? that we wish to estimate in a strictly causal manner from the {d(j)} 
such that 


a < Y forall i=0,1,...,N (45.44) 
wiw? + 2^ vr 
J= 


where $j); denotes an estimate for s; using the data (d(0), d(1),...,d(j— 
1)). One construction for the desired estimates can be computed as follows. 
Start with w—1 = 0 and P. , = II^! and iterate for 0 € i < N: 


Sii = Liwii 
BÀ = PA-yPLG L 
But . 
wi = wid —————— |d(i)- uwii] 


1+ uP; iuf 


-7I Li EMI i 
r= (Je pret) 
Gi = Pal Lp a|r: 

P, = Pi- GrG} 


while Syjv—1 = Lywy-i and wy = wn-1. This solution satisfies (45.44) 
if, and only if, the {P;} are positive-definite for 0 < i < N. 


45.4 LMS ALGORITHM 


Consider again the choice L; = u; so that s; = ujw?, i.e., s; is now a scalar, s(). In 
this case, we know from Sec. 45.2 that the recursion for P; reduces to (45.24). We can 
then pose the same question with regards to the value of y in (45.44) when L; = u;. That 
is, can it be smaller than one? As before, the answer is negative for sufficiently exciting 
regressors as in (45.25). To see this, we use expression (45.24) for p.t along with 


Poy = Poi- Yuta 
to write 
ue i-1 
Pri =I + (1-y?) X ukur — yuzu (45.45) 
k=0 


For any y < 1, it is easy to conclude that for sufficiently large i, at least one diagonal 
element of the matrix 


must become negative. This fact violates the required positive-definiteness of (Bia) so 
that y < 1 is not possible when the regressors are sufficiently exciting. What about y = 1? 
If we set y = 1 and choose II = 711, for some small p > 0, then it follows from (45.24) 
that 


P; = pl 


Moreover, the following equalities hold: 
Pai (l + upi) = (Po tutu) lu? = Pout au 


so that the recursions of Alg. 45.3 collapse to the LMS algorithm with step-size u — see 
Alg. 10.1. 

Moreover, when the regressors (u;) are sufficiently exciting, as in (45.25), it can be 
verified that LMS provides a solution to the following min-max problem over all finite- 
energy noise sequences (see Prob. XI.5): 


oo 
LEGI - 0 - s) 
opt E MN. | med Do 
GGB7 DI {we} | u-ilwel2 + E wi) 
j=0 


In other words, the recursions of the algorithm minimize, through the choice of the strictly 
causal estimates ($(j|j — 1)}, the largest possible value of the energy gain from the dis- 
turbance sequence (w^, v(j)} to the error sequence (3(7|j — 1)) and, moreover, 73,, = 1. 
We therefore find that LMS, which was derived in Sec. 10.2 by appealing to stochastic- 
gradient approximations, is in fact an optimal recursive solution to the optimization prob- 
lem (45.46). 


Algorithm 45.4 (LMS algorithm) Consider data {d(z),u;} that satisfy the 
model d(i) = u;w? + v(i), for some unknown vector w° and finite energy 
noise sequence (v(i)). Let s(i) = u;w? denote the uncorrupted part of d(i) 
and assume we wish to estimate s(i) in a strictly causal manner from the 
{d(j)} such that 


EBU- 1) - «G)P 
AA «1 forall 1=0,1,...,N (45.47) 
p-l|we|? + 2» vG)P 

I= 


where 8(j|j — 1) denotes an estimate for s(j) using the data {d(0),...,d(j— 
1)). One construction for the desired estimates can be computed as follows. 
Start with w., = 0 and iterate for 0 € à < N: 


s(iji = 1) = UWi-1 
wi = Wi-1 + puž jd(i)— uwi_y] 
while S(N|N — 1) = unww-1. This solution satisfies the robustness condi- 


tion (45.47) if, and only if, the matrices {u—1I — ufu;) are positive-definite 
for0 € i « N. 
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Observe that the existence condition in the statement of the algorithm is in terms of the 
positive-definiteness of the rank-one matrices {471I — ufu;). The eigenvalues of every 
such matrix are given by {p7}, 475,... 473, pT? — ||u;||?}, with (M — 1) eigenvalues 
at u~} and one eigenvalue at j^! — ||u;||? (assuming regressors of size 1 x M). Therefore, 
the matrices (4,71 — ufu;) will be positive-definite for ¿ = 0, 1,..., N if the step-size u 
is chosen to satisfy 

sup u|u;l? < 1 (45.48) 
1 


In other words, the step-size yz needs to be sufficiently small in order for the LMS filter to 
satisfy the robustness condition (45.47). 


45.A APPENDIX: H” FILTERS 


In this appendix we explain the general form of 71~ filters for linear state-space estimation, and then 
show how the adaptive robust filters derived in the body of the chapter can be obtained as special 
cases (in much the same way as RLS itself was seen to be a special case of the Kalman filter in 
App. 31.2). For more details, the reader is referred to the monograph by Hassibi, Sayed, and Kailath 
(1999, Chapter 4) and to the article by Sayed, Hassibi, and Kailath (19962); the inertia (existence) 
conditions listed below follow the latter reference where they are derived in a manner similar to the 
arguments used in App. 44.B. 
Consider a state-space model of the form 


Tii = Fixi + Giu, yi Hizi +v i20 (45.49) 


with initial condition zo, and where {F;, Gi, Hi) are known n x n, n x m, and p x n matrices, 
respectively. Moreover, the (u;, v;) denote disturbances, which along with zo, are assumed to be 
unknown. Using the observations (y;), we would like to estimate some linear combination of the 
state entries, say, s; = Liz; for some given q x n matrices L;. Let $;;; denote an estimate of s; that 
is based on the observations (y;) from time 0 up to and including time i. Likewise, let $;,;_ denote 
an estimate of s; that is based on the observations (y;) from time 0 up to time i — 1. In other words, 
Sij; depends on the observations in a causal manner, while $;;;..; depends on the observations in a 
strictly causal manner. We refer to $;|; as a filtered estimate and to 3;;;-1 as a predicted estimate. 
Define further the filtered and predicted errors Si = s; — Siy and Sji1 = Si — Sii i. 


H” A Posteriori Filters 

H” a posteriori filters are concerned with the computation of filtered estimates. In this context, 
we want to determine estimates (3519, $111, 822, ... , Sw) at times à = 0, 1,..., N such that, 
regardless of the disturbances (v;, v;) and the unknown zo, the estimates guarantee 


PTL 
Oo] <<} for 1 = 0,1,...,N (45.50) 
zollo !zo + D llul? + X lvl? 

j=0 j-0 


for some given positive scalar 1} and positive-definite weighting matrix IIo. This requirement has 
the following interpretation. For every time instant 7, the numerator is a measure of the estimation 
error energy up to time 7, while the denominator consists of two terms: 


1. One term is the energy of the disturbances, [u;, vj}, over the same period of time. 


2. The other term, z5II; ‘zo, is the weighted energy of the error in estimating zo by using, for 
the lack of any other information, a zero initial guess for it. 


We therefore say that the denominator in (45.50) is a measure of the energies of the disturbances in 
the problem, while the numerator is a measure of the resulting estimation error energies. In this way, 
criterion (45.50) requires that we determine estimates (85; such that the ratio of energies, from 
disturbances to estimation errors, does not exceed y for any (vj, uj, w}. When this property holds, 


we say that the resulting estimates are robust in the sense that bounded disturbance energies lead to 
bounded estimation error energies. 

It turns out that a recursive solution exists for state-space models of the form (45.49); in a manner 
similar to the solution of Alg. 45.1. 


Ss Ss Ss 
Algorithm 45.5 (74^? a posteriori filter) One construction of an a posteriori «^? 
filter is found as follows: 


Sj = Fyrir- + Pu Hi (I+ HIPGaHD) (yi - Ho Áaa) 
Sj - Li 
2 
; = -yfl 0 ;c— RH Li Tm * * 
Ri | 0 I | » Ri =Ri+ | Hi Pii [ Li Hj ] 
* * * * ~l Li * 
Pagi = BP uiiFi + GiGi] — FP [ Li Hj l R.i H. Pai 


with initial conditions 2 ;,,.; = 0 and Poj-1 = IIo. This construction satisfies 
(45.50) if, and only if, R; and Re, have the same inertia for all 0<i< Nor, equiv- 
alently, if and only if, I+ HiP;,-1H7 > 0 and -yI + Li(Py, NES HD Ip <0 
forO<i<N. 


iji— 


^^? A Priori Filters 


H” a priori filters are concerned with the computation of predicted estimates. In this context, we 
want to determine estimates (301-1. 3110, 8211, ..., SNN —1]) at times à = 0, 1,..., N such that, 
regardless of the disturbances (u;, vj) and the unknown zo, the estimates guarantee 


p li 
= <% for i=0,1,...,N (45.51) 


zlote + D usl? + Dll 


for some given positive scalar ^2 and positive-definite weighting matrix IIo. Compared with (45.50), 
we see that the sums over |{v,||* and ||; ||? run only up to time instant  — 1 due to the strict causal- 
ity constraint. Again, it turns out that a recursive solution exists for state-space models of the form 
(45.49); in a manner similar to the solution of Alg. 45.3. 


Algorithm 45.6 (H a priorifilter) One construction of an a priori H™ filter is 
as follows: 
Sci Lifqion E = Pi- 1705 LiLi 
ipia co Fiĉin-i + F,É, H7 (I+ HiP- H? ) (y - Hiĉiji-1) 
2 
-~I 0 Li 
Ri = Yp : ei ili Pii- H i 
| i J Raids *[&] win om] 
* * * -1 Li * 

Pags = KP Fi + GiGi - Pus | Li Hj | Rat | H, |a- iF; 
with initial conditions o|-1 = 0 and Poi = Ilo. This construction satisfies 
(45.51) if, and only if, jj; > OforallOx i € N. 
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Special Case of Adaptive Filters 

It is now straightforward to verify that the a posteriori and a priori H™ filters of Algs. 45.5 and 45.6 
reduce to the robust adaptive filters of Algs. 45.1 and 45.3 when the state-space model (45.49) is 
specialized to 2:41 = z; and y(i) = uiz; + v(i). That is, when F; is set equal to the identity 
matrix and H; is set equal to a row vector u;. Moreover, y; becomes a scalar y(?). Substituting these 
choices into the a posteriori H™ filter of Alg. 45.5, we find that it collapses to 


* 
Pii-iui 


Wu = PLY qup d P (y(i) — ui£i-iji-1) 
$45 i(1— 4 
Si = wf 
2 
—y;ü 0 Li "E 
R = | o 3 rca ree] mes uf | 


a Li 
Pipin = Para - Pins | Li ut | Ret | z | Poe 


Comparing with the statement of the a posteriori adaptive robust filter in Alg. 45.1, we find that the 
equations are identical once the identifications shown in Table 45.1 below are made between the ^? 
variables and the adaptive filter variables. A similar conclusion holds for the a priori H® filter. Its 
equations reduce to 


Sici = Lau 
SL E za -2:4p 
Pia = Pii — Yp LiLi 
X Buut . x 
fiui = Žiji- t (y(i) — uidi) 


1+ w Piu 


2 
; = -=l 0 jT Fu Li 
Ri | o 1 | » RH | T 


-1 | Li 
Pizia = o Pg - Py | Low laa | n LE 
which agree with the equations of Alg. 45.3. 


TABLE 45.1 Correspondences between H™ and robust adaptive variables. 


| ^£ variable | Adaptive variable | Description — | 
EEFOMNEEFNENMNEN 1t _| 
Exc mem Weight vector 
ET [DRE 
[Suas we | Weightesümate | 
ENTM MEER NEN DT UTD 
ES CUNT. a CUI UT 
[ $api || w- | — dmiiacondtion | 
ee oe | ea 


Description 
Measurement 


State vector 


State estimator 
State estimator 


Riccati variable 


Coefficient matrix 
Initial condition 
Initial covariance 


curren 46 


Robustness Properties 


This chapter deals with robustness analysis, as opposed to robust filter design. While the 
presentation in Chapter 45 was concerned with a framework for designing robust filters and 
applying it to the study of LMS and e-NLMS, the results obtained therein are not immedi- 
ately applicable to studying the robustness performance of other adaptive filters. To do so, 
in this chapter we resort to the same energy-conservation arguments that we employed in 
Parts IV (Mean-Square Performance) and V (Transient Performance) while studying the 
performance of adaptive filters. As a byproduct, we shall gain further insights into the ro- 
bustness performance not only of LMS and e—NLMS, but of other adaptive filters as well. 
Besides providing a more intuitive route to the robustness results of Chapter 45, the energy 
arguments also lead to tighter robustness bounds. The discussion in this chapter is self- 
contained and it approaches the subject of robustness from first principles; the presentation 
does not rely on the indefinite least-squares theory developed in the previous two chapters. 


46.1 ROBUSTNESS OF LMS 


Let us start with the LMS algorithm in order to illustrate the main ideas. Consider again 
measurements (d(i)) that arise from a model of the form 


dli) = uw? + v(i) (46.1) 


for some unknown weight vector w° and unknown disturbance v(i). The LMS algorithm 
estimates w° recursively according to the rule: 


wi wii + uuj|d(i) — uiwi-1], initial condition (46.2) 


where 4 is the step-size parameter. Introduce the error quantities: 
dj; = W? — Wi, €a(t) = uit)i-1, €p(t) = uÙ 
They can be related as follows. Subtracting both sides of (46.2) from w° leads to 
©; = Wi-1 — puž leali) + v(i)] (46.3) 


since 
d(i) — UjWi-1 = ea (1) T v(i) 


Squaring both sides of (46.3) and rearranging terms we obtain 


I? — [si I + wlea(@)/? — ulv (OP? = n leali) + v(i)? - Culiuill? — 1) 


Adaptive Filters, by Ali H. Sayed 
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This equality relates the energies of several quantities in an LMS implementation. The 
identity is equivalent to the energy-conservation identity that we encountered before in 
Thm. 15.1. Nevertheless, the energy relation is now written in a different form, in terms of 
{ea (i), v(é)} and not {ea (i), ep(i)}, which is more convenient to our present purposes. 

To begin with, observe that the right-hand side in (46.4) is the product of three terms. 
Two of these terms, u and |es(i) + v(i)|?, are nonnegative, whereas the last term, (u - 
[u;||? — 1), could be positive, negative, or zero depending on the relative sizes of js and 
[;||?. It follows that the quantities {w;, 22; 1, €a (i), v(i)) always satisfy the following 
inequalities in terms of their squared norms: 


UG; rules < lil? neo if uu s1 
GIP + ule)? = la +n)? if upu-1 46.5) 
Gall? ales > pP + ufo? if apu? 21 


wo N 


or, more compactly, assuming a nonzero denominator, 


; (2 
l|]? + u lea GO)? <1 if ulus sl 
la Jeane ) 7i if allude =1 (46.6) 
lwl? + u jv) 21 if pilus]? » 1 


This representation in the form of a ratio is more compact and is adopted only for conve- 
nience of presentation. Of course, we can avoid the assumption of nonzero denominators 
by working with the differences (46.5) rather than the ratio (46.6). 

The first inequality in (46.6), which corresponds to the case ul|u;||? < 1, has an inter- 
esting interpretation. It states that, for any step-size p satisfying 


plul? x 1 
and no matter what v(i) is, we have 


p E?  le«G)? < ug? + vr? (46.7) 


Summing over 0 < i < N we get 


N 


N 
(u^ RIP + lea?) < SO (lil? + P) (46.8) 
=0 i=0 


D 


But since several of the terms ||; ||? get cancelled from both sides of the inequality (46.8), 
we find that it simplifies to 


(46.9) 


N N 
[md 107 + D le (P S ui + D v? 
wl = 


or, in ratio format again, 


N 
u ||? + X leali)? 
£e <1 (46.10) 


N = 
uui? + 2 (P 
i= 


This relation holds for all N whenever 


ulul? <1 fo O<i<N (46.11) 


We have therefore derived a robustness result for LMS that is consistent with the result of 
Alg. 45.4. Indeed, note that Alg. 45.4 indicates, by setting $(7|j — 1) = ujw;j—1, that LMS 


satisfies 
N 


> lea(3) E 
j=0 
peer ee NTP ee 
po wll? + dea)? 
J= 
where w° = Ū—; since w~; = 0. Relation (46.10) is a tighter result, with the additional 
factor u~! | ||? appearing in the numerator. 


<1 (46.12) 


Cauchy-Schwartz Interpretation 
Before proceeding to examine, in a similar manner, the robustness of other adaptive filters, 
it is of value to step back and to take a closer look at the above robustness analysis of LMS. 
In the previous section, we simply squared both sides of the weight-error recursion (46.3) 
to conclude that inequality (46.7) holds for any y satisfying j4||u; ||? < 1. By summing both 
sides of (46.7) over 0 € i € N, we were then led to inequality (46.10), which establishes 
the robustness of LMS. Given the simplicity of this energy argument, it is worth examining 
the origin of inequality (46.7) from first principles. 

To do so, let w;_1 denote any generic estimate at iteration į — 1 of some unknown vector 
w°. This estimate could have been generated by the LMS filter or by any other filter. Now 
given any vector u;, pick an arbitrary positive number p satisfying 


Then it always holds that 


(46.14) 


luu? — mwil? € wo? lw? — will? 


This is because condition (46.13) and the Cauchy-Schwartz inequality guarantee that?* 
juw? — mwiaf? € juil? + jw? —wiil? € wo? lw? — wi-il? 


Note that the quantity on the left-hand side of (46.14) is the energy of the estimation error 
es (i) = uj;ü;-;. Likewise, the quantity on the right-hand side of (46.14) is the energy 
of the weight-error vector, w;-1 (weighted by Tt). Note further that if the right-hand 
side of (46.14) is increased by any nonnegative value, the inequality will continue to hold. 
Therefore, assume we add |v(i)|? to the right-hand side of (46.14), for any disturbance 
value v(i), then it always holds that 


le??? € uil? + v) (46.15) 
no matter how w;..; is generated! In other words, every adaptive filter should satisfy this 


inequality. 


24The Cauchy-Schwartz inequality states that for any two column vectors x and y, it holds that |z*y| < |lzl| - 
ilvll. 
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Lemma 46.1 (Trivial inequality) For any p satisfying u||u;]|? < 1, for any 


(v(i), uj), and for any estimate w;—ı of an unknown vector w°, it always 
holds that 


le (OP € alil? + fo)? 


This inequality holds for any adaptive filter. 


The natural question is how does inequality (46.15) change when the estimate w;_1 is 
generated by LMS? That is, how can knowledge of the specific algorithm that generates 
wj. be used to improve the inequality? To see this, assume that w,;_1 is obtained from the 
LMS recursion (46.2). Then we already know from (46.7) that a tighter inequality holds, 
namely, 


uus + le € uii? + OP (46.16) 


with an additional factor u^ !||i7;||? added to the left hand-side, when compared with 
(46.15). In other words, although the left-hand side is now larger than before, the in- 
equality still holds 


Robustness Interpretation 

It is also useful to examine the robustness result (46.10) from the perspective of bounded 
mappings. Specifically, it turns out that the result (46.10) can be interpreted as bound- 
ing the norm of the mapping induced by LMS between its a priori estimation errors and 
disturbances. To see this, introduce the column vectors: 


d.d i es (0) 
VO) ea (1) 
dist = | (1) and error = : 
: ea(N) 
v(N) wan 


The vector dist contains the disturbances that affect the performance of the filter; its en- 
ergy is the quantity that appears in the denominator of (46.10). Likewise, the vector error 
contains the a priori errors and the final weight-error vector; its energy is the quantity that 
appears in the numerator of (46.10). Now the LMS update (46.2) allows us to relate the 
entries of both vectors in a straightforward manner. For example, we have 


20) ee Ca (= s) 


which shows how the first entry of error relates to the first entry of dist. Similarly, for 
ea (1) = uit we obtain 


e) = (Vfl — nual) ^ -. — (murug) v(0) 


which relates e; (1) to the first two entries of the vector dist. Continuing in this manner, 
we can relate €,(2) to the first three entries of dist, e, (3) to the first four entries of dist, 


and so on. We can express these relations as: 


€q(0) x Č- 
€a(1) x x v(0) 
: = ; 2 v(1) (46.17) 
ea(N) : 
Jg UN X X X X X X v(N) 
error T dist 


with a lower-triangular mapping 7 relating dist to error. The symbol x is used to denote 
the generic entries of 7. The causal nature of the adaptive algorithm results in a lower 
triangular mapping 7. 

The ratio (46.10) can now be seen to be equal to the ratio between the energies of the 
input and output vectors of 7. However, we know from the discussion in Sec. B.6 that the 
maximum singular value of 7 admits the interpretation: 


mn. [Tal 
o(7) = mex Tall 


In other words, (7) measures the maximum energy gain from the input vector z to the 
resulting vector 7 z. It then follows that relation (46.10) amounts to saying that 


| Tasti 


max : 
disto || dist || 


so that the maximum singular value of 7 must be bounded by 1. We therefore say that the 
robustness of LMS guarantees that the singular values of the mapping 7 between dist and 
error are bounded by unity. 


46.2 ROBUSTNESS OF «-NLMS 


We now examine the robustness of other adaptive filters, by employing energy arguments 
similar to what we have just done for LMS. 
Consider the e—NLMS recursion with unit step-size, 


Wi = Wi-1 + 


aa [d(i) — ujw;-1] (46.18) 


Subtracting w? from both sides we find 


(ea (2) + v1) 


ŭi — di ui 
Co et ful 


Multiplying both sides by u; from the left allows us to relate {ep(i), €a(i)} as follows: 


so that 


eq (i) = (e+ ID) E Il ug 
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Using this expression for e; (7) we get 


d(i)— ujwici = éa(t) + v(i) (d(i) — ui wi] 


= (e+ lea [ep (i) +v(i)] = 


(e+ luill?) 
€ 

and substituting back into (46.18) we conclude that the e—NLMS recursion can be rewritten 

in the equivalent form 


Wi-1 = Wi - elur [d(i) — uwi] (46.19) 


with w; appearing on the right-hand side as well. This form looks similar to the LMS 
update (46.2), so that most of the derivations from Sec. 46.1 should extend almost literally 
to the present case. 

Indeed, subtracting w° from both sides of (46.18) now leads to 


Gj_1 = Ü; + Ttur epli) + v(i)] (46.20) 


with e,(i) on the right-hand side, as opposed to ea(i) as in the LMS case (46.3). After 
squaring both sides of (46.20) we get 


Ii? — Bill? — e ep)? + eT lo)? = 67 - lepli)  vG)P - (ull? + 1) 


where the right-hand side is now seen to be always nonnegative! Therefore, it always holds 
that 


~ +\ 12 me A12 
el? + lepli)" < elsi + lo 


Summing over 0 € i < N we get, in ratio form, 


N 
elüw|? + 2- ley (Ol? 
LL 


<1 (46.21) 


N 
e||@—1|/? + Ble? 


This relation holds for all N. In this way, we have derived a robustness result for e-NLMS 
that is consistent with the result from Alg. 45.2. Recall from the statement of Alg. 45.2, by 
using $(j|j) = ujw;, that e-NLMS satisfies 


N 
dep)? 
j=0 


ET eT E 
ewel? + Z v) 
J= 


where w°? = wW_ 1 since w., = 0. Relation (46.21) provides a tighter bound with the 
additional term ¢||@,||? appearing in the numerator. 


46.3 ROBUSTNESS OF RLS 


SS Ee d 
We now examine the robustness of the RLS filter as stated in Alg. 30.1, namely, 


Pi-iuj i 
cfi. ode die. 
dar WP. (i) UiWi 1] 


i 


Wi = Wi-1 + 


where, in addition, we know from (30.10) that 


(46.22) 


Using a derivation similar to the one used above for e—NLMS, it can be verified that the 
update equation for RLS can be rewritten in the equivalent form (see Prob. XI.7): 


Wi=-1 = Wi — Piu [d(i) = uwi] 


Subtracting w° from both sides leads to 

Wi-1 = D; + Py_-rujlep(4) + v(i)] (46.23) 
so that computing the weighted norm of both sides, by using PZ} as a weighting matrix, 
we get 
Uf PIU. = (Hi + Pry uj lepli) + vli) )* - PI: (Gi + P._sujfep(é) + v(@)] ) 
which, upon expansion and using (46.22), provides the equality 


DP i- — UDP; Ws — les)? + e)? 
(46.24) 
= (ep(i) + v(i))*[uiPi-ruj + 1](ep(i) + v(2)) — lep(2))? 


It is observed now that the right-hand side in the above equality is not guaranteed to be 
nonnegative, in general. However, observe the following. For any {ep(i), v(i)) it holds 
that 


lent) + P 2 lel" - IP 


as can be easily checked from the trivial inequality 


2 
20 


1 : ; 
peal) + V2) 


Therefore, adding ${e,(i)|? + |v(2)|? to both sides of (46.24) we get 


en DS oix. b ye . 
@}_, Py Gi-1 — IP, i; 3e (P + 2lu(i)|? 


= (ej) +)" Pang Desi) +0) = (ZeD - luo?) 


where the right-hand side is now guaranteed to be nonnegative since 


(ep (i) + v(1))" (uiFi-1uj )(ep(4) + v(2)) 
+ le G) + v)? 
(ep(i)  v(i))" (ui Pi-iu; )(ep(i) + v(1)) 


+ zle - tu? 


(p(t) + v(i))" lu. Pi-iui + 1](ep(i) + v(i) 


IV 


It follows that 


wipes. sob j 2i f es A 
TPP d + zles (P < Da PAT- + oli) 


1 
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Summing over 0 < i < N we get, in ratio form, 


N 
S Py Gy + 2 © lep (OP 


€ 1 (46.25) 


N 
G2 MG. + 25 oP 


This relation holds for all N. In this way, we arrive at a robustness result for RLS as well. 
This result can be reworked into a more familiar form as follows. The inequality (46.25) 
would still hold if we increase the denominator by adding w* ,I]w_, to it, so that 


N 
Dy Py Ün + 2 2 ley (OP 
= 


N < 2 
wto- + wp? 
i=0 
from which we conclude that 
N " 
2 lep(i)| 
— r « 4 (46.26) 


N 
w* Mð- + do)? 


The result (46.25) gives a tighter bound than (46.26). This last inequality shows that for 
RLS, the ratio of the energies from disturbances {i_1, v()} to estimation errors {e,(7)} 
never exceeds 4. In Prob. XI.17, a similar bound is derived using the a priori estimation 
errors {€q(-)}; this latter bound, however, will be data dependent. 


Summary and Notes 


Te chapters in this part describe a notion of robustness and a procedure for designing 
robust filters by relying on indefinite-least-squares theory. 


SUMMARY OF MAIN RESULTS 


1. A recursive procedure similar to RLS is derived for time-updating the solution of an indefinite 
least-squares problem. Algorithm 44.1 does so for the general case in which the observation 
data and the regression data are allowed to be vectors and matrices, respectively. In contrast 
to the block RLS solution of Prob. VII.36, for example, we now require the successive P; to 
be positive-definite in order for the successive weight vectors to correspond to global minima. 


2. There are many notions of robustness. The one described in Chapter 45 requires that we bound 
the energy gain from disturbances to estimation errors at each iteration. This formulation 
reduces the design of robust filters to the solution of indefinite least-squares problems. 


3. Two kinds of robust filters are designed: a posteriori filters and a priori filters. It turns out 
that e-NLMS is a special case of the former while LMS is a special case of the latter. It is 
further argued in the text and in Prob. XI.5 that LMS and e—NLMS are optimal robust filters. 


4. Chapter 46 develops a framework for robustness analysis by using the same energy conser- 
vation arguments that we employed earlier in Parts IV (Mean-Square Performance) and V 
(Transient Performance). This framework allows us to examine the robustness of a larger 
class of adaptive filters (e.g., RLS, filtered-error LMS, Perceptron, etc.), and to bring forth 
connections with some system-theoretic concepts such as passivity relations, small gain con- 
ditions, and /2 —stability — see, e.g., Probs. XL6-XI.12. 


BIBLIOGRAPHIC NOTES 


Derivation of robust filters. The a posteriori and a priori robust adaptive filters studied in 
Secs. 45.1 and 45.3 are special cases of a broader family of robust filters known as H® filters (in 
much the same way as RLS itself is a special case of the Kalman filter, as was shown in App. 31.2). 
An overview of H®™ filters is given in App. 45.A. A more detailed treatment of 71° filters, as well 
as H% controllers, can be found in the textbook by Green and Limebeer (1995) and in the mono- 
graph by Hassibi, Sayed, and Kailath (1999). Several approaches can be used to derive H™ filters, 
especially for general state-space models, such as completion-of-squares arguments, game-theoretic 
arguments, and Krein space arguments. In Secs. 45.1 and 45.3 we chose instead to follow the pre- 
sentation of the least-squares theory in Chapter 30. In this way, readers will be able to realize more 
immediately the connections between robust adaptive filters and least-squares adaptive filters. Fur- 
ther discussions on the connections between robust adaptive filters and ?(?? filters can be found in 
Hassibi, Sayed, and Kailath (1996,1999), Sayed, Hassibi, and Kailath (19962), and Sayed and Rupp 
(1997). 
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H” framework. Research on 7(^? designs was pursued rather systematically during the 1980's 
and 1990's. Most works in this area were primarily concerned with the analysis and design of robust 
filters and controllers; robust in the sense that they limit the effect of modeling uncertainties on the 
performance and stability of the resulting filters and controllers. The original work in the field is that 
of Zames (1981) on sensitivity issues in feedback control. Since then many ingenious approaches and 
viewpoints have been put forward with applications over a wide range of areas. See, for example, the 
books by Green and Limebeer (1995), Basar and Bernhard (1995), Dahleh and Diaz-Bobillo (1995), 
Zhou, Doyle, and Glover (1996), and Helton and Merlino (1998), which focus mostly on control 
problems, and the monograph by Hassibi, Sayed, and Kailath (1999), which takes an estimation 
perspective to the solution of both filtering and control problems. Some earlier references are the 
articles by Khargonekar and Nagpal (1991), Yaesh and Shaked (1991), and Shaked and Theodor 
(1992), which deal with filtering issues as well. 


H@-optimality of LMS. As we saw in Chapter 10, the LMS algorithm is usually derived as a 
stochastic-gradient approximation to a steepest-descent method. In Sec. 45.4, we argued that LMS 
can be derived as an optimal solution to a well-defined robustness criterion (cf. the statement of 
Alg. 45.4). This result was established by Hassibi, Sayed, and Kailath (1993,1996) using connections 
with H% filtering theory (cf. App. 45.A). Extensions of this conclusion to the backpropagation 
algorithm for neural network training can also be found in Hassibi, Sayed, and Kailath (1994,1999), 
Further extensions to time-variant step-sizes, Gauss-Newton recursions, and filtered-error algorithms 
can be found in Sayed and Rupp (1994,1996,1997,1998) and Rupp and Sayed (1996ab). 


Energy conservation. The energy-conservation relation (46.4) (or, equivalently, (15.32)) is due 
to Sayed and Rupp (1995). It was used by the authors in a series of works, including Rupp and Sayed 
(1996ab, 1997,1998,2000) and Sayed and Rupp (1996,1997,1998), in their studies on the robustness 
and small gain analysis of adaptive filters. Several of their robustness results are discussed in the 
problems at the end of this part. The energy arguments of Chapter 46 are due to Sayed and Rupp 
(1994,1996), as well as the Cauchy-Schwartz argument of Sec. 46.1; this latter argument also appears 
in Sayed ànd Kailath (1994b). An extension of the energy conservation argument of Chapter 46 to 
continuous-time LMS appears in Sayed and Kokotovic (1995). 


Comparison of RLS and LMS. It is instructive to compare the robustness performance of LMS 
and RLS for the case of single-tap adaptive filters (in which case the regressor is a scalar). The 
following example is extracted from Hassibi, Sayed and Kailath (1996). Assume that the regression 
signal randomly assumes the values +1 with probability 1/2 and let w° = 0.25 be the unknown 
weight that we wish to estimate. The noise signal is assumed to be zero-mean and Gaussian. We first 
employ LMS, i.e., 
w(i) = w(t — 1) + uu(i)[d(i) — u(i)w(i — 1)] 
and compute the initial N = 100 weight estimates starting from w(—1) = 0 and using u = 0.97. 
We also evaluate the entries of the resulting mapping 7 from (46.17), which we now denote by Tims. 
[Observe that the chosen value for p satisfies the requirement p[ju(i)|? < 1 for all à.] 
We then employ RLS, i.e., 


ET pli — l)u(i), ,,. nt A Xi - 1) 
w(t) = w(i — 1) + T+pa—1) 4) -u(iw(i-1), p(t) = Tota) 
with initial conditions p(—1) = y and w(—1) = 0. We again compute the initial 100 weight 
estimates using 4 = 0.97 and evaluate the entries of the corresponding mapping 7 , now denoted by 
Tris (in this case, it can be verified that p(i) = w/(1 + (i + 1)y) for alli > —1). 

Figure XI.1 shows a plot of the 100 singular values of the mappings Tims and Typis. As expected 
from the robustness analysis in the chapter, we find that the singular values of Tims (indicated by 
an almost horizontal line at 1) are all bounded by one, whereas the maximum singular value of 7;;, 
is approximately 1.65. Observe, however, that most of the singular values of 7.1, are considerably 
smaller than one while the singular values of Zim. are clustered around one. This fact has an interest- 
ing interpretation. If the disturbance vector dist happens to lie in the range space of the right singular 
vectors of 7,4, that are associated with the smaller singular values, then its effect will be significantly 
attenuated by RLS. This fact indicates that while LMS has the best worst-case performance (in the 
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FIGURE XI.1 Plot of the first 100 singular values associated with the LMS and RLS mappings. 


sense that it guards against worst-case disturbances), the RLS algorithm is expected to have a better 
performance on average. 


Feedback structure. In Fig. 15.5 we showed that an intrinsic feedback structure can be associ- 
ated with every adaptive filter update in the energy domain (cf. Sayed and Rupp (1995) and Rupp 
and Sayed (1996a)). The feedback structure was motivated in these references, and in Chapter 15, 
using energy arguments. It is seen to consist of two distinctive blocks: a time-variant lossless (i.e., 
energy preserving) feedforward mapping and a time-variant feedback path. In Probs. XI.9-XT.13 we 
examine this configuration more closely and show that it lends itself to analysis via a so-called small 
gain theorem from system theory (see, e.g., Vidyasagar (1993) and Khalil (1996)). Such analyses 
lead to stability and robustness conditions that require the contractivity of certain operators. Results 
along these lines can be found in Rupp and Sayed (1996a,1997) and Sayed and Rupp (1997). 

There have been several earlier interesting works in the literature on feedback structures for adap- 
tive filtering, e.g., by Ljung (1977b) and Landau (1979,1984), as well as by Popov (1973) on hyper- 
stability or passivity results — see also Anderson et al. (1986). These earlier works were of a different 
nature than the structure of Fig. 15.5; they were concerned with the fact that the update equations 
of adaptive filters can be represented in a recursive manner. The feedback structure of Fig. 15.5 is 
instead concerned with energy propagation and, among other features, it enforces a lossless map- 
ping in the feedforward path and allows for a time-variant mapping in the feedback (in comparison, 
hyperstability analyses require one of the paths to be time-invariant (see Landau (1979, p. 381)). 


TABLE Xl.1 Three FxLMS variants. 


Algorithm Complexity | Memory — | Convergence | 
FxLMS 


Bjarnason (1992), Kim et al. (1994) 
Rupp and Sayed (1998) 


Robustness of FxLMS and active noise control. A widely used algorithm in active noise con- 
trol is the so-called filtered-x least-mean-squares (FxLMS) algorithm (see, e.g., Widrow and Stearns 
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(1985), Sethares, Anderson and Johnson (1989), and Elliott and Nelson (1993)). The algorithm is 
discussed in the computer project at the end of this part. In its standard form, the algorithm has 
LMS complexity, i.e., it requires 2M operations per sample. However, it exhibits poor convergence 
performance. Modifications, referred to as modified FxLMS, were proposed by Bjarnason (1992), 
Kim et al. (1994), and Rupp and Sayed (1998), in order to ameliorate the convergence problem — 
see Table XI.1, which compares the performance of these variants in terms of complexity, memory 
requirement, and convergence. In the table, M is the length of the adaptive filter and Mr is the 
length of the error filter (as described in the computer project). 


Problems and Computer Projects 


PROBLEMS 
Problem XI.1 (Indefinite quadratic cost) Consider the quadratic cost function (44.1) and choose 
M=1,H =1,W = (06 -2),y = col{a, 8}. 

(a) What is the signature of the coefficient matrix II + H*W H? 


(b) Verify that the gradient vector of J(w) evaluates to zero at @ = col{0, 28}, and that J(@) = 
2|8|?. Show that @ is the only vector where the gradient of J(w) is zero. 


(c) Are there any other choices for w at which J(w) evaluates to 2|2|?? 


(d) Plot the contour curves of J(w). 


Problem XI.2 (Negative-definite coefficient) Refer to (45.19). 
(a) Conclude from (45.20) that the matrices (A^! & (1 + uj P; iuf) 1) and T; are congruent. 
(b) Use the existence condition (45.15) to conclude that A < 0. 


Problem XI.3 (Robust formulation with weighted noise energy) Refer to the discussion in 
Sec. 45.1. Assume again that we are given measurements d(i) = u;w° + v(i) and that we wish to 
estimate s(i) = u;w° in order to satisfy the criterion 


———————— « 9 forall i=0,1,...,N 
wwe + ad |v(s)|? 
j=0 


for some a > 0. Repeat the arguments of that section to show that one construction for the desired 
estimates is as follows. Start with w-1 = 0 and P_; = II^! and iterate: 


— Pui da -uwi-i, SGi) = uiwi 
a7! + u,Pj-1ut 
Pi-1utuiPi-i 


= (a = y72)71 + u;P;-iu; 


Wi = Wi-i-c 
P, = Pia 
Argue that this solution satisfies the above robustness criterion for all y > 1/4/a. 


Problem XI.4 (Non-sufficiently exciting regressors) Assume that condition (45.25) does not 
hold so that lim 55 (X uur) < pl, for some p < oo. Choose II = el, e > 0. Use (45.26) 
to argue that P; > 0 for any "y satisfying y? > p/(e + p). 


Problem XI.5 (Min-max interpretation) Refer to the discussion in Sec. 45.2 and assume the 
regressors are sufficiently exciting as in (45.25). 
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(a) Argue that, for any y > 1, the estimates (3(7|7)) that are constructed according to the e- 
NLMS algorithm guarantee the following bound, for all finite-energy noise sequences (v(i)]: 


Y SG) - s) 
j=0 
sup — m < 


{wr} | ewel + S^ wl) 
j=0 


(b) Conclude that the e—NLMS recursion of Alg. 45.2 is a solution to the following min-max 
problem over all finite-energy noise sequences {v(i)}: 


oo 
L ISG) - sG)? 
ort inf sup ———áÀ—— 
He vus estis S gs 
j=0 
Argue further that 42,, = 1. 
(c) Repeat the derivation for the LMS algorithm of Sec. 45.4. More specifically, show that LMS 


is a solution to the following min-max problem over all finite-energy noise sequences {v(i) }: 
eR EAT A2 

2^ SG - 1) - s(j)| 

j= 


2 : 
fopt inf | sup | —————— —— 
Coo GUbTD ens T +S fo)? 

j-0 


Argue further that ypt = 1. 


Problem XI.6 (Passivity relation) Consider LMS-type updates with time-variant step-sizes of 
the form w; = wi-1 + plijuz [d(i) — uiwi-1] with (i) > 0. 


(a) Show that 
ITN? + a) les COT? 


IH? + uli) les COD 
Ils] + u(i) les CO? 


Wal? + awa if pw <1 
ll? HAWA if nu =1 
lõ HHO — if nG)bwul? 2 1 


IV. IIA 


(b) Define the normalized quantities a(i) = /u(i)ea(i) and a(i) = v/u(i)v(i) and assume 
p(4)||uil|? < 1. Conclude that 


(c) What does the relation of part (b) collapse to when u(i) = u? 
(d) What does the relation of part (b) collapse to when pu(i) = 1/(e + |ui ||?)? 
Remark. For more details see Sayed and Rupp (1996). 


Problem XI.7 (RLS recursion) Consider the RLS recursion of Sec. 46.3. Follow the argument 
of Sec. 46.2 to show that the recursion for w; can be rewritten in the equivalent form wi-1 = 
wi — Pi-iut(d(i) ^ uiwi]. 


Problem XI.8 (Convergence result) Consider the LMS recursion 
wi = wi-1 + pui[d(i)—- uiwi-i, w-1 =0 


and assume the regressors (u;) are bounded and that the step-size jz is positive and bounded from 
above, say p < c < inf; (1/||u;||?). Assume further that the data (d(i), u;) satisfy d(i) = uw? + 


v(i) for some unknown M x 1 bounded vector w°, and for some unknown noise sequence v(i) with 
finite energy, i.e., limy—oo D lwl) < œ. 
(a) Use the contraction relation (46.9) to conclude that ea (7) — 0 as i — oo. 


(b) Assume the regressors are persistently exciting, namely, there exists a finite integer L > M 
such that the matrix col (ui, ui+1, ... , Ui+z } has full rank for sufficiently large i. Conclude 
from part (a) that w; — w° as i — oc. 


(c) Assume instead that the noise sequence has finite power, as opposed to finite energy, i.e., 


N 
: 1 A2 
Jm, (* 2.) se 
Show that the sequence {ea (i)) will also have finite power. 


Problem XI.9 (Small gain condition) Consider the feedback structure of Fig. XI.2 It has a finite- 
dimensional lossless mapping 7 in the feedforward path, i.e., 7 satisfies 7 7" = 7*7 = I. It also 
has an arbitrary finite-dimensional mapping F in the feedback path. The input and output signals of 
interest are denoted by (z, y, r, v, e). In this system, the signals z, v play the role of disturbances. 
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FIGURE XI.2 A feedback structure with a lossless feedforward mapping. 


(a) Establish the energy conservation relation ||y||? + llel? = ||z||? + ||r|[?, and conclude that 
[ell < llz|| + |[r|]. Here |} - || denotes the Euclidean norm of its vector argument. 


(b) Let ||F|| denote the maximum singular value of F. Use the triangle inequality of norms to 
show that |r|] < lloll + || Fi] - lell. 


(c) Conclude that if F satisfies the small gain condition ||F|| < 1 (in which case we say that F 
is contractive), then 


lell € tq lel Well] 


Remark. This result indicates that a contractive feedback mapping F guarantees a certain robustness 
performance from the disturbances {z, v) to the output (e). In this case, we say that the mapping from 
(x, v) to (e) is i2 —stable. 


(d) Plot 1/(1 — j|Z]|) for 0 < ||.7]| < 1. Conclude that the smaller the value of || F|: 


(d.1) The smaller the effect of (z, v) on {e}. 
(d.2) The smaller the upper bound on |/e|]. 


Problem XI.10 (/;-stability of adaptive filters) Consider adaptive filters of the LMS-type with 
time-dependent step-sizes 
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E and assume the data {d(i), ui} satisfy d(i) = u:w° + v(i) for some unknown vector w° and dis- 
Part XI turbance v(i). Define the weight-error vector and the a priori and a posteriori estimation errors 
PROBLEMS 4j; = w? — wi, €a (i) = uiv; 1 and ep(i) = ui. Let also (i) = (lu;||?)*. 


(a) Show that ep(i) = (1— y(i)|}uil!”) eal) — ulus). 
(b) Establish the energy conservation relation 


ll? + BG) lea(é)|? = liil? + a) les? 


Conclude that the mapping from 


(Ev, V A(O) ea(0), V A(1) €a(1), seg NE A(N) ea(N)) 
(8-1, V A(O) ep(0), VAC) ey(1),..., VACN) ep(N)} 


to 


is lossless. 
(c) Assume u; # 0. Define 


m — fala, I2 z Y, H2 

m(N) = max |1- Olut] ^ and — &(N)- max, n()luil 
Use the small gain condition of Prob. XI.9 to establish that if n(N) < 1, i.e, if u(z)]|uill? < 2 
for? = 0,1,..., N, then the following inequalities hold: 


~ = fg A2 1 B N 1. n 
2.00 lea) 1—n(N) | |@-1]| + KN) 2 MOHOJ | 
uli i)? KPN " 1/2 NM 2 

A 2740 le (^ < I | Ili] + K?N) Loo | 


How would these results change if u; = 0 for some i, say at i = io? 


(d) Assume n(N) < c < land «(N) « b < oo forall N. That is, (n(.N), &K(N)) are uniformly 
bounded as indicated. Assume further that ||u;|| ? > e > 0 for all i. If the normalized 
noise sequence (4/u(i)v(i)) has finite energy, conclude that ea(i) — 0 as à — oo. If, 
in addition, the regressors (u;) are persistently exciting, as defined in Prob. XI.8, conclude 
that w; — w°. Remark. Compared with Prob. X18, the convergence result is now obtained under 
the condition of a finite-energy sequence {./y(i)u(i)}, which requires ,/ju(i)u(i) — 0 as opposed to 
v(i) — 0 (see Sayed and Rupp (1996) and Rupp and Sayed (1996a)). 


I^ 


Problem X!.11 (Energy propagation In feedback) Consider the same setting as in Prob. XI.10. 


(a) Explain that the results of parts (a) and (b) of that problem can be represented in diagram form 


as shown in Fig. XI.3, where the lossless mapping from (6; 1, y u(ijea(i)} to (i, / u(i)ep(1)) 
is indicated by £. 


(b) Assume v(i) — 0. From parts (a) and (b) of Prob. XI.10 show that 


|]? = dil? — wi) [2 — n GO usl?] les CU 


Conclude that ||@:||? < ||&;.1 ||? if, and only if, 0 < u(i)]jui] < 2. 
(c) Show that choosing u(i) such that u(i) lju: l}? 
going from ||. ||? to ||&; |. 


= 1 results in the largest energy decrease in 


(d) Which algorithm corresponds to the choice u(i) ||us ||? = 1? 


Remark. Observe that for the choice of step-size in part (c), the feedback path in Fig. XL.4 is disconnected, so 
that there is no energy flowing back from the lower output of the lossless map into its lower input. 


valili) WL — 1 Viiiea(i) 


plul? 


1- «lui? 
FIGURE XI.3 A feedback structure representing each iteration of an LMS-type filter with time-dependent 
step-sizes. 


Problem XI.12 (Filtered-error adaptive filter) In applications such as active noise control and 
vibration control, adaptive filters that rely on filtered errors become necessary, i.e., adaptive updates 
of the form 

wi = wi-ı + u(i)ui F[d(i) — uiwi-i] 


where the notation F'[:] denotes some filter that operates on the error signal d(i) — uiwi—1. In these 
applications, filtered versions of the error are more readily available than the error signal itself, so 
that the adaptive filter update would need to rely on F[d(i) — u;w;-1] rather than on d(i) — uiwi-i. 
For our purposes here, we shall assume that F[-] is an FIR filter of order Mr with transfer function 


F(z) = jy fü». 


(a) Let d(i) = uiw? + v(i). Introduce the estimation errors e (i) = u;ii-1 and ep (i) = wii, 
where W; = w° — wi. Show that 


epli) = (es?) - w(t) |luill?Flea(é)!) — nG)llul Fea] 


and 

~ afs +\ 12 ~ mds " 
lõ? + H(i) les (I^ = Iii? + AH) le Gr 
where ji(i) = 1/||ui||? if u: 4 0 and (i) = 0 otherwise. Remark. Of course, this is the 
same energy-conservation relation that we derived earlier in Thm. 15.1 for general functions g(-) of 
d(i) — uiwi-i. 


(b) Assume u; # 0, v(i) = 0, and choose u(i) = o/||u;||? for some a > 0. Show that the 
mapping from ( / 1(0)ea (0), . . . , / B(N )ea(N)) to ( / &(0)ep (0), . . ., /B(N)ep (N)) is 
described by the (N + 1) x (N + 1) lower triangular matrix 

1 — af(0) 


p(t) 

, -eYm D  1-af(0) 
N = AQ) B(2) 

oy (2) -a D 1-af(0) 


in terms of the coefficients {f (j)} of the filter F(z). Since usually N + 1 > Mr, argue that 
Fn is banded with only Mr diagonals. 
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(c) Define the filtered-noise sequence v(i) = F[v(i)] and the coefficients 7(N) = I|. || and 
&(N) = maxo<i<n pi(i)||uil|?. That is, n(N) is the largest singular value of Fy. If n(N) < 
1, show that relations similar to those in part (c) of Prob. XL10 hold with v(ż) replaced by 
a(i). 

(d) Usually, the length Mr is much smaller than the length of the regression vector u;. Thus 
assume that the energy of the regression vectors does not change appreciably over Mr suc- 
cessive iterations, so that i(i) e A(i — 1)  ... œ% Bli — Mr). Verify that, in this case, Fn 
can be approximated by Fy ~ I — oCw, where Cy is the (N + 1) x (N + 1) convolution 
matrix associated with the filter F(z), i.e., C is a banded Toeplitz matrix, e.g., for Mr = 3, 


f(0) 
fa) f(0 
Cy =| f(2) fü) f0) 
f(2) fA) f0) 


Argue that the contractivity requirement |||] < 1 can be met if the step-size parameter a 
satisfies max,, p — aF(e?* | < 1, where F'(e/^) denotes the frequency response of F(z). 


(e) Assume v(i) = 0 and let e = {y A(0)ea(0), / B(1)ea(1), ..., / B(IN)ea (IN)) which, as 
seen from part (b), is the input to the mapping Fy. Under the same conditions of part (d) show 
that || ||? < l-1]? + (I — aCw)|l* lel]? — |lel]?. Argue that an approximate choice for 
a that maximizes the decrease in energy from ||&... ||? to ||&iw ||? can be obtained by solving 
ming max, |1 — aF(e™)| over all values of œ that guarantee max |1 — oF(e?*)| « 1. 


Remark. For more details see Rupp and Sayed (19962). 


Problem X1.13 (Transfer function analysis) Consider the LMS recursion 
wi-wi-i-guie(i,  e(i)-d(i)- uiwi-i 
and assume u; has shift structure, i.e., u; = [ uli) u(i—-1) .. wi-M-1) | Assume 
further that u(i) is sinusoidal, say u(i) = C cos(woi) for some (C,ws). Let d(i) = uiw? + v(i) 
and define the weight-error vector à; = w° — wi. 
(a) Verify that the update for the k—th entry of the weight-error vector is given by 


| z Cr wie xs 
[wi], = [ia], — E [ec Bue p p73G ve eli) 


Let Wa (z) denote the z—transform of the sequence {[õ;]+}. Ignoring initial conditions, 
verify that 


Wi (z) --j bw [E(ze-feo)e-ihos E E(ze** eee] 


(b) Likewise, using e; (i) = u;ij;-i show that 


M-1 
Es(z) = QUÉ > [Wi (oe 7e) 7 + Wi (zej"o)efe 67] 


2 k=0 


(c) Substitute the expression for We (z) into Ea(z) and ignore the effects of the mixing terms 
E(ze?/"^) and E(ze7? e) to show that 


uC? M — 1-—zcos(u,) 
2 2?—2zcos(w.) +1 


E.(z) * E(z) 


Using E(z) = Ea(z) + V(z), conclude that the transfer function from v(-) to e;(-) is ap- 753 
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via) 7 22—2z cos(wo) (1-4) + 1-4) 
where fi = 2/C? M. 


(d) Define V(z) = £V(z)- (1 - £) E, (z). Show that E, (z)/V(z) = (27! —cos(wo))/(z— 
cos(wo)). Conclude that the transfer function from v(-) to e; (-) is all-pass. Conclude further 
that the transfer function from v to e; can be represented as a feedback structure, as shown in 
Fig. XI4, with an all-pass filter in the forward path and a constant gain in the feedback loop. 


FIGURE Xl.4 — A transfer function description for LMS in terms of an all-pass forward mapping and a feedback 
gain. 


Remark. There are some analogies between the transfer function representation of Fig. X1.4 and the feed- 
back structure of Fig. 15.5 derived earlier using energy conservation arguments. However, the arguments 
that led to Fig. 15.5 did not use any approximation and were carried out in the time domain. The transfer 
function derivation, on the other hand, assumes regressors with shift structure, uses a sinusoidal input 
sequence (although some other classes of signals can be used), and ignores the effect of initial conditions 
and nonlinear mixing terms. Still, the transfer function approach offers useful insights into the perfor- 
mance of LMS-type algorithms. It was first described by Glover (1977) and later extended by Clarkson 
and White (1987). 


Problem XI.14 (Filtered-error LMS) Consider the filtered-error LMS recursion from Prob. XI.12, 
wi = wi-itpujFle(i)], eli) = d(i) — uiwi-i 


Repeat the transfer function derivation of Prob. X1.13 and show that it leads to 


Ea(z) _ aC? M F(z) (1 — z cos(wo)) 
V(3) 2a cos(wo) ( - ECM F(a) + (1 = ECOM F(z) 


Conclude that the transfer function from v(-) to e;(-) can again be described in terms of an all-pass 
feedforward path and a dynamic feedback loop given by 1 — EF(z), where ji = 2/C? M. 


Problem XI.15 (Two inequalities) Consider any column vectors {z, y}, and any positive-definite 
matrix P. 

(a) Show that for any a > 0, (x + y) P(z +y) 2 (1— 4) z*Pz + (1 — a)y* Py. 

(b) Likewise, show that for any a > 0, (x + y)' P(z +y) € (1 + 2) z' Pz + (1 + a)y* Py. 


Problem XI.16 (Two optimization problems) Establish the following conclusions. 
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(a) For any 8 > 0, i 
se (HE) = var 


and the minimum occurs at a = 1 + vØ. 


(b) For any 8 > 1, 5 
— (= +a us) x (5-1)? 


a>0 aci 


and the maximum occurs at o; = /D — 1. 


Problem XI.17 (RLS algorithm) Let d(i) = u;w? + v(i) and consider the RLS filter: 
Pi-iui 


wi = mat upra O Mee i=0,1,..., N 
EE Ba- Pi-iuiul Pii Pac 


1+ uBP.iw' 


which we know minimizes, over w, the cost function J(w) = w*IIw + 57.9 |d(i) — uiw[^. W 
also know from Lemma 30.1 that the resulting minimum cost is given by either of the following 
expressions: 


e. Je NE , 
EN) = OSS = XI aewBAaw) 
1-0 i-0 


wu; Pi-iu; 


where e(i) = d(i) — uiwi-i and r(i) = d(i) — uiwi. Let ij; = w° — wi. Then it also holds that 
e(i) = ea (i) + v(t) and r(i) = es (i) + v(i). 
(a) Since J(w) > £(N) for any w and {v(-)}, conclude that the following inequality holds: 


"Tw? 7 i 1 X ; A2 
w^'IIw^ + ior = { ain, (Jb (Zeo + v(i)| ) 


(b) Define r — max, (1+ ui P; -1u]). Use the result of part (a) from Prob. XI.15 to conclude 
ss 
that, for any a > 1, 


Y eG < nm ww? + (jee E QUEE ) wor 


i=0 i=0 


(c) Use the results of Prob. XI.16 to conclude that 


N N 
PASON È leo (f? 


< (14 VT)! and ——— = < (14 V10)? 


N N 
we*Tiwe + X |u(i)|? wo* Tw? + Y |v(i)|? 
i=0 i=0 


where r = min (1+ u;Pi-1u}), and conclude that 
ois N 


E es 


Wak ot Ad Bak Gin a 
wo*ILw? + F |v(i)|? 
i-0 


Remark. This result agrees with (46.26). The energy argument that led to (46.26) provides a tighter 
bound. For more details on the results of Probs. XI.15-XI.17 see Hassibi and Kailath (2001) and also the 
monograph by Hassibi, Sayed, and Kailath (1999). 
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Project XI.1 (Active noise control) Consider the setting of Fig. XI.5. The noise from an engine, 
usually in an enclosure such as a duct, travels towards the rightmost microphone location. In order 
to diminish the noise level at that location, the incoming noise wave is measured by the leftmost 
microphone and used to generate a noise-like signal by a loudspeaker. The purpose of this replica is 
to cancel the arriving noise signal at the rightmost microphone. 


Primary Secondary 
source Microphone source Microphone 
d(i) 
) Ji camden ee es 


Noise-like jt 
signal oN dli) 


Processing 


FIGURE XI.5 An active noise control system in a duct. 


In an adaptive implementation of an active noise control system, the structure of Fig. XI.6 is 
used. The figure shows the input noise signal u(i) and a filtered version of it, denoted by d(i), which 
corresponds to u(i) travelling down the enclosure until it reaches the secondary source. 

In Fig. XI.6, an anti-noise (or noise-like) sequence, d(i), is generated by an adaptive FIR filter 
of length M at the secondary source with the intent of cancelling d(i). The difference between both 
signals d(i) and d(i) cannot be measured directly but only a filtered version of it, which we shall 
denote by e; (i) = F[d(i) —d(#)]. The filter F is assumed to be FIR and its presence is due to the fact 
that the signals (d(i), d(i)) have to further travel a path before reaching the rightmost microphone. 
This path is usually unknown, and the objective is to update the adaptive filter weights in order to 
reduce or cancel the filtered error e; (i). 

One adaptive algorithm that has been developed for such purposes is the so-called filtered-x least- 
mean-squares (FxLMS) algorithm. The algorithm requires filtering the regression vector by the same 
filter F, namely, it relies on the update 


wi = wi- + u(i) (Ffu) Fe), e()-d()-d(), d(i-uwii| (FxLMS) 


Here, as usual, u; denotes the regressor of the tapped-delay-line adaptive filter. Moreover, the nota- 
tion F'[z] is used to denote the filter F applied to the sequence z. 

The FxLMS algorithm as such requires knowledge of F in order to process the regression data 
by it. In addition, it tends to exhibit poor convergence performance. In this project, we shall focus 
mostly on a filtered-error least-mean-squares variant (FeLMS), which does not require knowledge of 
F. Its update equation has the form: 

(FeLMS) 


wi = wi-1-c (i)uilFle(i), eli) = d(i) — uiwi-i 


That is, it is an LMS update with e(i) replaced by e;(i). This is the same algorithm studied in 
Prob. XI.12. We shall assume that F(z) = 1 — 1.227! + 0.72 z ?. Let u(i) = o/|[u;||? for some 
parameter a > 0 and fix the filter order at M = 10. 
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(a) 


(b) 


(d) 


Microphone 


Secondary 
source 


ej(i) 


FIGURE XI.6 Structure for adaptive active noise control. 


Consider the maximization g(o) — max |1 — o.F(e?")]. Plot g(o). For what range of o is 
uw 
g(a) « 1? Determine the value of a that minimizes g(a). 


Load the file path.mat; it contains 10 samples of the signal path impulse response from u(i) 
to d(i). Assume that an additive zero-mean white Gaussian noise v(i) corrupts the signal 
d(i) and set its variance level at 60 dB below the input signal power. Train the filtered- 
error LMS filter and generate ensemble-average mean-square deviation curves for the choices 
a € {0.01, 0.02, 0.03, 0.05, 0.1}. Observe how the steady-state values of E ||® n ||? vary 
with a. Use binary data +1 with equal probability as input data (u(i)). Run each experiment 
for 30000 iterations and average the results over 50 experiments. 


Train again filtered-error LMS and generate ensemble-average learning curves for the choices 
a € (0.05,0.1,0.2,0.5,0.6) . Which value of o results in fastest convergence of the learning 
curve? Run each experiment for 5000 iterations and average the results over 50 experiments. 
Generate learning curves for both cases: 


1. input u(i} is +1 with equal probability. 
2. input u(i) is white Gaussian with unit variance. 
When the variations in the weight estimates are slow over the length of the filter F, say 


wi X Wi-1 & ... Z Wi-mp, We can approximate F[u;wi;-i] by F[ui]wi-1, so that the 
FxLMS update can be approximated by 


wi = wi-1 + w(t) (F[u:])" (F[d(i))] — F[ui]wi-1) 


This update now has the same form as a standard LMS update and, correspondingly, similar 
performance, with the signals {u;, d(z)} replaced by their filtered versions ( F[u;], F{d(i)]}. 
A modification to FxLMS that also makes its update become similar to that of an LMS update, 
without the need for the slow adaptation assumption, can be obtained by adding two terms to 
the update equation as follows: 


wi = wi-1 + pi) (F{ui])* (Fle(i)] + F(uiwi-i] - Fluijwi-1) (mFxLMS) 


This variation shows improved convergence performance over standard FxLMS. 

Set œ = 0.15 and generate learning curves for the following four algorithms: a) NLMS, 
b) filtered-error LMS (FeLMS), c) filtered-x LMS (FxLMS), and d) modified filtered-x LMS 
(mFxLMS). Run each experiment for 5000 iterations and average the results over 50 experi- 
ments. Use white Gaussian input data with unit variance. 
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RCA, 183, 217 
RLS, 163, 198, 262, 287, 311, 406, 493 
Sato, 207 
sign-sign LMS, 206 
sign-error LMS, 183, 206, 216, 257, 285 
sign regressor LMS, 206, 268, 398 
square-root RLS, 583 
Steepest-descent, 142 
stochastic gradient, 163 
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stop-and-go, 207, 226 
almost-sure stability, 387 
analysis filter bank, 449 
angle preservation, 568 
angle-normalized error, 677 
APA, see affine projection algorithm 
array algorithm, 564, 594, 599 
array methods 

information form, 592 
array processing, 114 
auto-correlation, see random process 
auto-regressive model, 304 
averaging analysis, 300, 363 


backpropagation algorithm, 744 
backward consistency, 640 
backward prediction, 622 
backward projection, 518, 658 
backward stability, 594 
backward a posteriori error, see error 
backward a priori error, see error 
band partitioning, 413, 427 
baseband signal, 101 

basis rotation, 565, 594, 595, 608 
batch processing, 697 

Bayes' rule, 4, 537 

Bayesian approach, 52 
beamforming, 101, 135 

Bessel function, 325, 558 

bias, 113, 190, 322 

blind algorithm, 183 

block adaptive filter, 413, 426, 466 
block inverse QR, 596 

block LMS, see algorithm 

block processing, 127, 426 

block QR, 596 

block RLS, 551-553 
Bryson-Frazier algorithm, 511 
Bussgang's theorem, 309 


calculus of variations, 112 

carrier frequency, 101, 326 

carrier frequency offset, 273, 317 
catastrophic cancellation, 562 
Cauchy-Riemann conditions, 26 
Cauchy-Schwartz inequality, 240, 737 
causality, 719, 727 

Cayley-Hamilton theorem, 345 
CDMA, 113 

CDMA2000, 113 


central solution, 722 
Chandrasekhar filter, 618, 641, 646 
channel estimation, 80, 91, 137 

adaptive, 168 

MIMO, 125 
characteristic polynomial, 344 
Chebyshev’s inequality, 2, 54 
Chi-square distribution, see random variable 
Cholesky factor, 563 
Cholesky factorization, 17, 542 
Cholesky method, 542 
circulant matrix, see matrix 
CMA, see algorithm 
column span, see matrix 
combining estimators, 118 
companion matrix, 346 
completion-of-squares, 63, 481, 706 
complex differentiation, 25 
complex sign, 212 
composite source signal, 324, 474 
congruence, see matrix 
constant-modulus criterion, 213, 222 
constellation 

BPSK, 31, 43, 135 

QAM, 42, 224 

QPSK, 5, 42, 116, 134, 136 
constrained LMS, 216, 311 
constrained least-squares, 545 
constrained optimization, 90, 130 
contour curve, 155, 209 


convergence analysis, see transient performance 


convergence in the mean, 343 
convergence time, 353 
conversion factor, 219, 496, 516, 526 
order-update, 642 
time-update, 612, 642 
convolution, 34, 431 
block, 431 
correlation coefficient, see random variables 
criterion 
constant-modulus, 213, 222, 223 
leaky, 211 
least-mean-squares, 30, 32, 61 
least-squares, 476 
LMF, 212 
LMMN, 213 
multi-modulus, 214 
reduced constellation, 214 
sign, 212 
critical point, 709 
critically-decimated filter bank, 450 
cumulative distribution function, 326 
cyclic nonstationarity, 273 
cyclic prefix, 113, 122 


data-reusing algorithm, see algorithm 
data fusion, 118, 553 
data nonlinearity, 234, 308, 330, 397 
DCT, 417 
DCT-domain LMS, 423 
decimator, 433, 472 
decision device, 134 

hard decision, 35 

soft decision, 35 
decision-directed operation, 174 


decorrelation property, 414 
degree of nonstationarity, 275 
delayless implementation, 470 
DFE, see equalization 
DFT, 123, 417 
DFT-domain LMS, 421 
DFT block adaptive filter, 440 
constrained implementation, 444, 461 


unconstrained implementation, 441, 460 


diffusion, 392 

digital communications, 273 
discrete cosine transform, see DCT 
discrete Fourier transform, see DFT 
discrete Hartley transform, see DHT 
distributed adaptive filters, 299, 392 
distributed processing, 119, 299, 392 
dithering, 211, 303, 640 

Doppler frequency, 326, 558 

double talk, 300 

downdating, 549 

drift, 190, 215, 303, 322 

dual-sign LMS, see algorithm 

duct, 755 


echo cancellation 
acoustic, 223, 300, 426, 467, 473 
composite source signal, 324, 474 
echo return loss, 325 
echo return loss enhancement, 325 
far end, 323 
line, 223, 300, 323, 467, 473 
near end, 323 
echo path, 324, 473 
eigen-decomposition, see matrix 
eigenfilter, 129 
eigenvalue spread, 150, 413 
embedding, 432 
EMSE, see excess mean-square error 


energy conservation relation, 228, 234, 236, 276, 


735 
algebraic derivation, 235 
geometric interpretation, 240 
physical interpretation, 241 
system-theoretic interpretation, 243 
weighted version, 330 
ensemble-average learning curve, 174 
envelope, 101 
equalization 
adaptive, 224 
adaptive linear, 171 
adaptive DFE, 172 
blind, 186, 214, 225, 311 
DFE, 93, 114, 130, 136 
fractionally-spaced, 114, 323 
frequency-domain, 123 
linear, 70, 82, 114, 117, 130, 134, 137 
MIMO, 123 
symbol-spaced, 114 
zero-forcing, 114 
equivalence, 501, 502, 538 
ergodicity, 263, 477 
error 
a posteriori, 233, 276 
a priori, 232, 276 
angle-normalized, 677 
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backward a posteriori, 658 
backward a priori, 673 
forward a posteriori, 661 
forward a priori, 673 
output, 230, 276 
weight, 232, 276 
weighted a posteriori, 331 
weighted a priori, 331 
error covariance matrix, 65, 109, 116 
error nonlinearity, 234 
error surface, 155 
estimate, 29 
estimator, 31 
affine, 38, 60, 61 
best linear unbiased, 89 
constrained, 87 
least-mean-squares, 32, 45, 50 
linear, 38, 61 
linear least-mean-squares, 66, 79, 142 
maximum a posteriori, 537 
maximum-likelihood, 537 
minimum variance unbiased, 89 
MMSE, 32, 45, 50, 89 
unbiased, 33, 76, 88 
Euclidean norm, 167 
excess mean-square error, 154, 230, 239, 274 
exponential distribution, see random variable 


fading, 302, 325 
fading channel, 302, 325 
FAEST, see fast a posteriori error sequential tech- 
nique 
far-end signal, 323 
fast array RLS, 614, 640 
fast Fourier transform, see FFT 
fast Kalman filter, 631 
fast transversal filter, 628, 640 
feedback stabilization, 635 
stabilized, 637, 650 
fast a posteriori error sequential technique, 630, 
640 
feedback filter, 94 
feedforward filter, 94 
FFT, 421, 436 
filter divergence, 539 
filtered estimate, 732 
filtered-error adaptive filter, 750, 753, 755 
filtered-x adaptive filter, 755 
finite alphabet, 304 
finite precision, 237, 302 
finite precision effects 
LMS, 302 
NLMS, 303 
finite word-length effects, 302 
finite-memory RLS, 549 
fixed-order filter, 610 
fixed-point arithmetic, 321 
forgetting factor, 498, 613 
forward prediction, 624 
forward projection, 525, 661 
forward a posteriori error, see error 
forward a priori error, see error 
fractionally-spaced equalizer, see equalization 
Frobenius norm, 541 
FTF, see fast transversal filter 


fullband filter, 127 


gain vector, 494 

order-update, 643 

time-update, 611, 643 
Gamma function, 55 
Gauss-Markov theorem, 52, 89, 545 
Gaussian random variable, see random variable 
Gaussian distribution, see random variable 
Givens rotation, 571 
gradient noise, 163 
gradient vector, 25, 63 

instantaneous approximation, 165 
Gram-Schmidt procedure, 19, 105, 196, 219 
group delay, 85 
growing memory, 498 


Hoo criterion, 744 

Hoo filter, 732 

hands-free telephony, 473 

Hessian matrix, 25, 63 

higher-order spectra, 128 

histogram, 258 

homogeneous difference equation, 215 
Householder transformation, 576, 605 
hybrid, 323 

hyperbolic rotation, 602 

hyperbolic transformation, 614 
hyperstability, 745 


idempotent matrix, 484 
impulse response, 117 
impulsive noise, 703 
indefinite least-squares, 705 
inertia conditions, 717 
recursive formulation, 710 
singular weighting matrix, 714 
independence assumptions, 247, 336, 363 
inertia, see matrix 
inner product, 568 
instantaneous approximation, 165 
instrumental variable method, 128 
inter-symbol interference, 70, 82, 93, 137 
interference cancellation, 221 
interpolator, 434, 472 
inverse QR method, 581, 598 
involutary matrix, 576, 605 
iterative reweighted least-squares, 536 


joint process estimation, 656 
Jordan canonical form, see matrix 


Kalman filter, 104—110, 112, 501, 502, 539 
array covariance form, 591, 598 
array information form, 592, 598 
array methods, 589-592 
Chandrasekhar recursions, 618 
covariance form, 109, 590, 618 
information form, 592 
innovations process, 104 
measurement-update form, 110, 590 
Riccati recursion, 109 
square-root information filter, 592 
time-update form, 110, 590 

Karhunen Loéve transform, 419 


Kronecker product, 23, 358, 389 


1) -norm, 211 
l2-stability, 749 
Lagrange multiplier, 509 
Laguerre adaptive filter, 650 
lattice filter 
a posteriori, 672, 696 
à posteriori error feedback, 680 
a priori, 673, 696 
à priori ertor feedback, 676 
array, 688, 696 
connection to Kalman filtering, 702, 703 
finite-precision effects, 703 
Givens based, 696, 699 
least-mean-squares, 125 
normalized, 682 
QR based, 688, 696 
leaky-LMS, see algorithm 
learning curve, 154, 161, 174, 387, 390 
LMS, 363 
NLMS, 374 
least-mean-squares 
estimate, 30 
estimation, 29 
estimator, 32, 42, 45, 50 
weighted, 116 
least-perturbation property, 167, 182, 193, 219 
least-squares criterion, 476 
least-squares problem 
Cholesky method, 542 
constrained, 545 
indefinite, 705 
minimum-norm solution, 22, 541 
normal equations, 479 
order-update, 515 
orthogonality condition, 479, 490 
over-determined, 478 
projection, 479 
QR method, 538, 542 
regularized, 487 
robust, 546 
under-determined, 478 
weighted, 485 
least-squares solution 
completion-of-squares argument, 481 
differentiation argument, 480 
geometric argument, 478 
minimum-norm, 483, 541 
Levenberg-Marquardt method, 162 
Levinson-Durbin algorithm, 125, 127 
likelihood function, 537 
line echo cancellation, see echo cancellation 
line enhancement, 221 
linear equations 
consistent, 75 
under-determined, 168 
linear model, 78, 90, 231 
Lipschitz condition, 366 
LMS algorithm, see algorithm 
log-likelihood function, 537 
lossless mapping, 243, 314, 749 
loudspeaker, 473, 755 
Lyapunov stability, 300 


machine precision, 321 


manifold, 194 
Markov’s inequality, 54 
matrix 
Cholesky factor, 563 
circulant, 122, 128, 430, 432, 547 
column span, 14 
companion, 346 
congruence, 16 
determinant, 115 
eigen-decomposition, 12 
full rank, 15 
Hermitian, 12 
idempotent, 484 
ill-conditioned, 542 
inertia, 16 
involutary, 576, 605 
Jordan canonical form, 115 
nonnegative-definite, 6, 12 
nullspace, 14, 74 
positive-definite, 12, 17 
pseudo-circulant, 128 
pseudo-inverse, 22, 483 
QR decomposition, 19 
range space, 14, 74 
rank, 15, 74, 541 
rank deficient, 15 
rank-one, 115 
Schur complement, 16, 48, 64 
singular value, 20 
singular value decomposition, 20, 484 
singular vector, 20 
spectral decomposition, 12 
square-root, 562 
stable, 145, 347 
structured, 466 
Toeplitz, 122, 127, 128 
trace, 44, 115 
unitary, 243, 417 
matrix inversion formula, 78 
maximal ratio combining, 86, 121 
maximally-decimated filter bank, 450 
maximum a posteriori estimation, 537 
mean, see random variable 
mean estimation, 90, 113 
mean-square convergence, 343 
mean-square deviation, 350 
data-normalized filter, 375 
LMS, 353 
NLMS, 373 
mean-square performance 
APA, 268, 307 
CMA, 268, 311-313 
constrained LMS, 311 
data-normalized filter, 375 
independence assumptions, 247 
leaky-LMS, 266, 401 
LMF, 267, 306 
LMMN, 267, 306 
LMS, 244, 350, 389 
NLMS, 252, 373 
NLMS with power normalization, 267, 306 
RLS, 262, 311, 406 
separation principle, 245, 252, 307 
sign-error LMS, 257 
sign regressor LMS, 268 
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small step-size assumption, 245, 392, 394 
white Gaussian input, 246 
mean-square stability, 346 
mean-square-error, 30, 61 
microphone, 473 
MIMO RLS, 647 
MIMO adaptive filter, 647 
MIMO channel, 123, 131 
min-max optimization, 149 
minimum-norm solution, 22, 483, 541 
minimum-phase system, 415 
misadjustment, 230, 239, 283 
mismatch, 474 
mixing condition, 364 
MMSE combiner, 122 
MMSE equalizer, see equalization 
mode, 146 
MRC, see maximal ratio combining 
MSD, see mean-square deviation 
MSE, see mean-square error 
multi-antenna system, 123, 131 
multi-carrier modulation, 538 
multi-delay filter, 445 
multi-modulus criterion, 214 
multichannel algorithm, 647 
multichannel LMS, 647 
multichannel RLS, 647 
multipath channel, 326, 558 
multirate DSP, 128 
multistage detection, 133 


N-dependent regressors, 363 
near-end signal, 323 
negative-definite matrix, see matrix 
Newton’s method 

algorithm, 160, 211 

learning curve, 161 
NLMS algorithm, see algorithm 
noise 

colored, 135 

estimation, 91, 137 

Gaussian, 290 

mixed, 292 

uniform, 290 

white, 7, 81 
noise cancellation, 221 
non-blind algorithm, 183 
noncritically-sampled filter bank, 450 
nonstationary environment, 232, 271 
norm, 22 
norm preservation, 566 
normal equations, 61, 73, 141, 479 
notch filter, 220 
nullspace, see matrix 
numerical stability, 538, 599, 633, 650 


oblique projection, 128 

ODE method, 384 

OFDM, 113, 123, 538, 547, 554 
optics, 241 

order-recursive filter, 653 
order-recursive solution, 125, 127 
order-update, 515 

orthogonal complement space, 484, 541 
orthogonal vectors, 14 


orthogonality principle, 36, 72 
output error, see error 
overlap-add structure, 439, 455 
overlap-save structure, 439, 455 
oversampled filter bank, 450 


Paige’s algorithm, 594 
parallel-to-serial conversion, 434, 471 
PARCOR, 696 
passivity relation, 748 
peak distortion criterion, 114 
peak-to-average ratio, 558 
Poisson process, see random process 
polyphase component, 127, 429 
positive-definite matrix, see matrix 
power distortion, 117 
power method, 386 
power normalization, 180 
power spectral density function, 415 
power spectrum, 323, 414 
power-of-two error LMS, see algorithm 
pre-whitening filter, 414 
pre-windowed data, 696 
predicted estimate, 732 
prediction, 209, 621 
backward, 622, 672 
forward, 624, 672 
order-recursive, 125 
Yule-Walker equations, 127 
Price’s theorem, 258, 286, 309, 398 
principal axis, 156 
probability density function, 1, 4 
projection, 240, 479 
backward, 518, 542, 658 
forward, 525, 543, 661 
matrix, 484 
prototype filter, 448 
pseudo-circulant matrix, 128, 430 
Pythagorean theorem, 534 


QR decomposition, 19, 583 

QR method, 538, 542 

QR array algorithm, 584, 598 
quadratic cost, 62, 705, 747 
quantization effects, 302, 600, 651, 703 


RAKE receiver, 122 
random process 
auto-correlation, 70, 323, 414 
auto-regressive, 304 
ergodic, 263, 477 
Poisson, 120 
random telegraph signal, 73, 120 
spectrum, 323, 414 
stationary, 70 
random telegraph signal, see random process 
random variable 
centered, 77 
Chi-square, 54 
circular Gaussian, 8, 47 
complex-valued, 4 
conditional expectation, 54 
covariance matrix, 39 
exponential, 56 
Gaussian, 1, 7, 38 


mean, 1, 5 
moments, 60 
Rayleigh, 2, 54, 55 
second-order statistics, 60, 116 
spherically invariant, 47 
standard deviation, 2 
variance, 1, 5 
vector-valued, 6 
random variables 
correlation coefficient, 41, 54 
independent, 4, 116 
orthogonal, 5, 32 
uncorrelated, 4, 5, 32, 116 
random-walk model, 271 
range space, see matrix 
rank, see matrix 
Rao-Blackwell theorem, 52 
rate of convergence, 151 
ray 
incident, 241 
refracted, 241 
Rayleigh fading, 302, 325, 558 
Rayleigh random variable, see random variable 
Rayleigh-Ritz ratio, 12, 129, 378 
Rayleigh distribution, see random variable 
recursive least-squares 
QR method, 584, 598 
block, 551-553 
downdating, 549 
exponentially-weighted, 498 
extended, 512 
fast array algorithm, 614 
finite-memory, 549 
inverse QR method, 581, 598 
minimum cost, 497 
rescue mechanism, 634 
sliding window, 549, 645 
square-root, 583 
stability issues, 538, 633 
updating, 493 
reduced constellation criterion, 214 
reflection coefficient, 126, 665, 674, 682 
refraction index, 241 
regressor, 81 
sufficiently exciting, 725, 747 
regularization, 162, 487, 495 
regularized prediction, 621 
rescue mechanism, 634, 640 
Riccati recursion, see Kalman filter, 112 
RLS algorithm, see algorithm 
robust adaptive filter, 705 
a posteriori, 718 
a priori, 726 
robustness, 243, 705 
energy conservation arguments, 735, 744 
LMS, 730, 735, 744 
min-max interpretation, 747 
NLMS, 724, 739 
passivity relation, 748 
RLS, 740, 754 
roundoff error, 302 


saddle point, 708, 709 
Sato algorithm, 207 
Schur complement, see matrix 
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separation principle, 245 
sequential processing, 697 
serial-to-parallel conversion, 433, 471 
shift register, 166, 189 
shift structure, 250, 251, 255, 669 
sign regressor LMS, see algorithm 
sign-sign LMS, see algorithm 
signature matrix, 613 
similar matrices, 418 
similarity transformation, 418, 468 
single error model, 539 
single-carrier-frequency-domain equalization, 113, 
123, 558 
single-path channel, 325 
singular value, see matrix 
singular value decomposition, see matrix 
singular vector, see matrix21 
sliding window RLS, 549, 645 
small gain condition, 749 
small gain theorem, 745 
smoothing, 508 
Snell's law, 236, 240, 241 
SNR, 135 
arithmetic SNR, 116 
geometric SNR, 116 
space-time code, 120, 129 
spatial diversity, 121 
spatial sampling, 102 
spectral factorization, 414, 415 
spectral norm, 22 
spectral null, 225 
speech, 324 
speed of light, 326 
Square-root factor, 562 
square-root RLS, 581, 599 
stability, 235 
standard deviation, 2 
state vector, 81, 106 
state-space model, 106, 344, 590, 618 
stationarity, see random process 
stationary environment, 230, 329 
stationary point, 709, 716 
steady-state performance, 235, 237, 329 
steepest-descent, 223, 477 
algorithm, 142, 147 
leaky, 211 
learning curve, 154 
transient behavior, 148 
step-size, 143, 159 
stochastic approximation, 204 
stochastic gradient algorithm, 163, 229 
stop-and-go, see algorithm 
structured matrix, 466 
subband adaptive filter, 413, 447, 467 
closed-loop, 451 
open-loop, 451 
submultiplicative property, 22 
subspace methods, 114 
sufficiently exciting regressor, see regressor 
SVD, see singular value decomposition 
Sylvester's law of inertia, 16 
symbol error rate, 135, 137 
symbol recovery, 81 
symbol-spaced equalizer, see equalization 


786 


SUBJECT INDEX 


synthesis filter bank, 450 
systolic array, 594 


tapped delay line, 363, 610 
tensor product, 23 
time constant, 153 
Toeplitz matrix, see matrix 
trace, see matrix 
tracking performance 
APA, 294, 316 
CMA, 294, 320, 321 
degree of nonstationarity, 275 
leaky-LMS, 294, 405 
LMF, 294, 315 
LMMN, 294, 315 
LMS, 280, 404 
NLMS, 284 
NLMS with power normalization, 294, 314 
RLS, 287 
separation principle, 281, 284, 315 
sign-error LMS, 285 
small step-size assumption, 280 
white Gaussian input, 282 
training, 174 
training sequence, 80 
transfer function analysis, 752 
transform-domain adaptive filter, 413, 466 
transient analysis, 235, 237, 329 
transient performance 
APA, 408 
data-normalized filter, 374 
leaky-LMS, 400 
LMS, 340, 357, 389, 410 
mean behavior, 343 
mean-square behavior, 343 
mean-square stability, 346 
NLMS, 371, 395 
sign regressor LMS, 398 
small step-size assumption, 362 
transmit diversity, 113 
triangle inequality, 22, 749 
triangular factorization, 16, 48, 64 
trigonometric transform, 447 


uniform mixing process, 365 
unitary matrix, see matrix 
universal prediction, 697 
U-shaped spectrum, 326 


variance, see random variable 
variance relation, 237, 277 

weighted version, 333 
vibration control, 751 
Volterra filter, 649 


WCDMA, 113 

weight error, see error 

weighted least-squares, 485, 542 
Widrow-Hoff algorithm, see LMS algorithm 
Wiener solution, 73, 112 

Wiener filter, 112 

wireless communications, 325 


Yule-Walker equations, 127 


zero-forcing equalizer, 114 


